Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Justin's Regex Blog

Thinking in Regex

Expressions, Parsers, Validators, User Feedback and the tradeoffs we make.

While the traffic has a local nature to regexadvice.com/blogs, there has been a lot of examination of various expressions where a date processing expression has become the topic of note. Personally, I have huge problems with constantly having a date expression tossed out as an example. Primarily the expression is monolithic and already starting to breach, in terms of complexity, the barrier of usefulness. Second, because of its huge nature, and the large amount of time the original author took to create the expression, working with it to prove or disprove discussion is tedious and time consuming, almost not worth it. I'll take a short moment to lay in cement some ideas I have about the usefulness of regular expressions as validators.

Expressions:
Let's start with expressions. Expressions, while they appear light on the surface, are nothing more than generic parsing routines. Each element you place in the expression string is translated into string tokenizing and comparison routines. This is no different from writing in a computer language to give the computer instructions as to what to do. Expressions are a language in and of themselves and the complexity of the underlying implementation simply goes unnoticed. If you've checked any of Darren's or my posts on lexing and parsing you've gotten a sneak peak at what goes on behind the scenes in a regular expression engine.

Parsers:
Parsers are hard-coded routines for examining underlying data. If we take the date-time expression we can convert it into only a few lines of parsing code, definitely less than 30 even if we are very verbose. In fact the number of characters (not lines) in the final parser program is actually shorter than the date expression Michael Ash has posted. His does some extra work though, namely validation, so we'll get to that in a second. Underlying every expression is a highly optimized generic parser, but no generic parser can run at the same speed as an optimized specialized parser. It just can't happen. You can compile expressions under .NET and in some cases the resulting code is more specialized and runs much faster, but it still uses the underlying syntax and semantics of the expression engine which may still create some slow-downs.

Validators:
After a parser is complete the data it retrieves needs to be validated in many cases. The date routines have a number of conditionals that validate certain date regions that never actually existed as calendars were changed to support new knowledge of the movement of the Earth around the Sun. These validations run as assertions, and as soon as an assertion is met the expression is failed and the user is given a boolean feedback as to whether or not the date is valid.

Now, how many of you know the list of invalid date combinations between 1AC and 9999AC? I can definitely say that I don't. Validators are in place so that once the specialized parser is complete the values retrieved can be checked against lists or functions of valid or invalid values. The result is to provide feedback to the user in some form or another.

User Feedback:
The end result of parsing and validation is generally user feedback. The reason I can't stand behind monolithic boolean result expressions is that they don't tell the user what was wrong with the input. If there is a date range in the 1500's that is invalid because of a calendar change then the user needs to be made aware. By all modern calendar rules the date would be valid, but the dates disappeared from history and the user needs to know this else they'll find something wrong with your parsing/validation logic.

This brings up two types of validation. The first type of validation is the validation you perform on user input. The next is the validation the user performs when working with your software. If you don't give a reason why the user is wrong, the user will think that you are wrong. They'll think that your stuff is broken. Further, if you don't give real feedback about why the input is invalid the user can't double check you. After all, if you tell them why they are wrong, but they verify the input is actually valid, then they can notify you of errors in your own program. As Michael has pointed out, he has had many issues with various regular expression engines handling different clauses in different ways. At the end of the day I can't honestly say I'd feel safe using the expressions because of their complexity, their lack of feedback to the user as to what EXACTLY is wrong with the input date, their required support of an underlying regular expression engine, their reliance on complex features, and the fact that a specialized parser does the job faster, more accurately, and with more feedback potential.

Trade-Offs:
What are the trade-offs between parsers and expressions? Well, supposedly expressions are easy to use (not generate, but use), are compact, and are portable between multiple expression systems. However, this isn't necessarily the case. The one thing that expressions do for the user is generate parsers and validators of data without understanding the underlying technology of parsing and validating data. It is a tool-set and an abstraction.

In the end, the trade-offs are going to be:

  • Speed - The speed of running the parsing and validation logic are going to be much faster when written in shear code.
  • Portability - Basic parsing code is extremely portable and can be quickly translated between languages. Expression engines are expected to live up to a spec, but general and widespread implementations often differ in small details.
  • Complexity of Validation - Basic validation in an expression can often require specialized constructs. You often sacrifice the level of validation you can perform.
  • Feedback - This is the most important feature of any validation. You have to give the user proper feedback so they can either fix the string or find errors in your validation logic. Feedback is almost always platform specific and doesn't integrate well with complex validation rule-sets inside of a single regular expression.

Quote of the Post - Use the appropriate tool for the job by identifying solutions that solve all end user and programmer related problems.

Sponsor
Published Tuesday, August 10, 2004 9:04 PM by jrogers

Comments

 

TrackBack said:

August 14, 2004 1:09 AM
Anonymous comments are disabled