Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Justin's Regex Blog

Thinking in Regex

That range algorithm was more difficult than I originally pointed out due to some strange oddities.

What kinds of strange oddities would I be talking about?  Well, check out the following oddity:

699 through 700
699|700

699 through 701
699|70[01]

699 through 718
699|70\d|71[0-8]

698 through 701
69[89]|70[01]

You start to get the picture.  There are some oddities when the max value has 0's and the min value has 9's that lead the right edge of the ranges.  A standard algorithm as I had outlaid before would do the following:

699 through 700
leftmost = 0
starti = min.Length - 1 = 2

69[9-9]|70[0-0]|6[9-9]\d|7[0-0]\d

You see the problem?  Based on the algorithm of simply finding the leftmost variable character isn't enough, since we also have some dependence on the rightmost character.  There is one solution we can code in, only process a pattern possibility if the character isn't an extreme character.  For the lower bound ranges extremes are 9's and for upper bound extremes are 0.

699 through 700
leftmost = 0, starti =min.Length - 1=2, skip starti = 2 and starti = 1 based on extremes
go to final range processing (step 5).  Since max-min is now only 1, we won't add any patterns.  A null result-set?

What our algorithm really needed to be was 699|700.  However, based on our rules, that won't happen.  Another rule, then might be that we add the pattern if we've approached (i-1) == leftmost.

699 through 708
leftmost = 0, starti = min.Length - 1 = 2
starti = 2, skip 9 in 699, process 8 in 708, results 70[0-8]
starti = 1, skip by default, hit rule (starti-1)==leftmost, same for 708, results 699|708

So we need another rule.  Only process the default pattern if we haven't already processed a pattern.  By that means, the end result is now 699|70[0-8].  Darn, we start to add a large number of rules to the generation of the range alternation group.  There are still some items I haven't fully made elegant yet, as I've recently realized.  Take the following:

699 through 710
699|70\d|710

By our default rule of only processing non-zero digits would exclude us from creating 70\d (actually it wouldn't, since our code would skip it the first time through and then create it the second time through using one of the exceptions).  Since 70\d gets created as an exception 710 never gets created.

I'll leave these thoughts with you.  I'm still waiting for some algorithms to show up!

Sponsor
Published Monday, May 24, 2004 8:51 PM by jrogers
Filed under:

Comments

 

jrogers said:

Is there a reason you care about getting a minimal solution??

Wouldn't a "reasonable" solution be enough?
May 27, 2004 3:11 PM
 

jrogers said:

Think of it this way. If I write an algorithm that doesn't give the minimal solution, then why should I expect the regular expression engine on the back-end to create the minimal solution from what I give it? After all, the same process I am going through to minimize the footprint of the expression should be the same type of minimization process they are doing.

What if they aren't doing a minimization process on the back-end? What if they take my expression as given and treat the entire thing as a normal alternation group, processing each group in turn for a match? In this later case, which is the most likely case given a quick examination of the rotor regex code, my minimal expression will perform better, than a non minimal expression.

So yes, I think the minimal solution is a good goal, as it shows more understanding of the problem set, while a "reasonable" solution only shows that you understand the decomposition of the problem, and not the full problem.
May 27, 2004 3:26 PM
 

jrogers said:

> What if they take my expression as given and treat the entire thing as a normal alternation group, processing each group in turn for a match?

If that is so, you should also order the various alternates by probability of occurrance such that the most-likely alternate is first.
June 6, 2004 8:39 AM
 

TrackBack said:

May 25, 2004 3:52 AM
Anonymous comments are disabled