Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Justin's Regex Blog

Thinking in Regex

Examining the data flow model of using Regular Expressions for Dates

I think some diagrams are going to be extremely important for examining the problems of using regular expressions for validating and parsing dates. Especially as you start to rely on more and more complex and time consuming operations to validate additional constraints. The following diagram is what I call the Expression Data Flow Model for Dates. It isn't complex, it just shows the abstract processing from user input to result all the way through processing. I'll use the various blocks to talk about each step in turn. I've also provided a sample run of a sample input and expression along with the result and processing the output.

 

User Input:
This is where the user is sending data through your data flow pipeline. By using a regular expression you limit yourself to processing string input. This isn't all that bad since most date and time input on the web is done through a textual or string based field. More recently models have changed to supplying the user with a calendar input. Under these scenarios you can toss the expression out altogether and process the separated UI components of the calendar control itself.

You'll see how separating the paser/validator logic later will enable us to short-circuit several of the steps involved in the data flow model.

Expression Block:
The expression is a parser, as it will be splitting portions of the input string and placing them into groups (aka variables or fields). It is also a validator because it is verifying that specific characters are digits and possibly digits within specific ranges. It also verifies digit counts, separator characters, etc... If you want to add additional validation logic it has to be placed into the expression itself. Without ingeniously designing the expression the result or output is going to be very fixed with respect to the amount of information you get back.

Result:
Once the expression has run you get a single boolean result value of true/false to determine if the pattern was successfully parsed AND validated. You don't know what part of the process failed, only that it failed. There are no partial groups filled so you can't verify that it failed at a specific location in the expression (without ingeniously engineering the expression).

If the pattern does succeed then you have a number of groups or captures that identify the results. Notice that the entire match is stored in group 0 while your capture groups are stored in 1 through 3. The problem that you'll face is that you've ALREADY validated the digits using the expression, you basically performed more than half the work of converting the digits into an actual strongly typed piece of data, but it is STILL returned to you as a string. If you want to continue moving forward and working with the fields themselves you have to do some processing.

The point I have been trying to make with user feedback about expressions comes into play here. You'll notice that we can only tell the user (because the parser and validator is a black box) that the pattern was successful or not. We have no clue why the input failed. We can't identify that it failed for a critical reason like an invalid pattern (failed parsing) or whether it failed for a data and logic reason (a date that doesn't exist in history). Imagine for instance you started at site that published the history of the world and the user is inputting dates to try and find data. On your site, some dates just won't go in and the user can't figure out why. They go to google and search the same date and magically results appear. For dates that don't exist there are sure a lot of pages with information that happened on those dates. The reason? Well, not everyone in history gave a hoot that the calendar was changing and much of written history is overlapped with stuff that happened on dates that technically didn't exist.

Processing:
At the end of the data flow model you can't just keep a date around in the original string based form. A simple date with a 4 digit year is stored using 10 characters or 20 bytes, 10 if you are on a non-Unicode system. However, you really don't need that much information for a date. Without an associated time it will easily encode into a single integer. If you start parsing times then you'll need to use up to 8 bytes. The nice thing about the 8 byte version is that it is fully parsed and encoded. The information is retrieved using a series of quick mathematical transforms rather than having to reparse the string to get information.

If you plan on using the date then you need it in some form where you can work on the members. After all, you can't add easily add days, months, years, or do any sort of transformation on the results of the string in expression form. Some users may point out that JScript allows you to work with the groups as if the underlying capture were actually an integer but this is simply foolishness in not understanding that every time you do that a conversion is being made on your behalf. That means once you've already validated the data once you have to validate it again using conversions since a conversion is nothing more than a validation of the underlying string to see if it has an associated strong typed integer representation. Just doesn't seem to make much sense does it?

Conclusion:
The data flow model for parsing dates with expressions is bleak because you wind up spending a large amount of time parsing and validating inside of the expression itself only to do a large amount of additional processing after the fact. Michael has pointed out that existing methods for parsing often don't give much feedback either. With that in mind, we'll examine the data flow model and the additional feedback provided by using an existing parser with a supplementary validator. You'll see how converting data as you parse it prevents extra processing later and how deconstructing the black-box of the parser/validator that is a regular expression will enable you to start providing more feedback.

Sponsor
Published Saturday, August 14, 2004 4:18 PM by jrogers
Filed under:

Comments

No Comments
Anonymous comments are disabled