Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Justin's Regex Blog

Thinking in Regex

0 through N ranges are interesting, but what about M through N ranges.

You can read about the decomposition of 0 through N ranges here Value range parsing using Regular Expressions. Breaking down the dotted decimal IP byte processing.  Make sure to read the comments, because there is a way to use even fewer decompositions than I listed in the article through the use of optional captures.

Now the trick is examining ranges with both upper and lower bounds.  This is an empirical review as I'll take several ranges and decompose them, but I won't necessarily explain the mathematics for finding the count of the decompositions.  We'll start with the decomposition of 222-555, which requires 5 decompositions to yield all results.

22[2-9]
2[3-9]\d
[3-4]\d\d
5[0-4]\d
55[0-5]

The next process will yield the range validation of 22-555.  The difference here is that instead of having all numbers in the range of the same length, we now have numbers of varying lengths.  The decomposition doesn't change much, and is equivalent to the decomposition of the range 222-555, except that the lower bound number of 22 is now equivalent to 022-555 (this allows our decomposition to yield the same results).

2[2-9]
[3-9]\d
[1-4]\d\d
5[0-4]\d
55[0-5]

You may notice a simple pattern between the two.  There is a symmetry between where the character classes are used to denote custom ranges of digits and the static captures (both \d and non custom character classes are considered static captures for the purpose of this example).  There is definitely a mathematical relationship between the layout of the ranges, and I could probably spend 10 or so pages explaining it, but I'll leave it out.  The point is there is also an algorithmic association between the numbers that recursively works from the right side of the first range back through to the left and back to the right of the second range.

First, sort the ranges by min/max.  This is important.  Second, determine the first non matching character in the ranges starting from the left.

1.  222 and 555 are sorted thusly: min(222), max(555)
2.  compare 2 to 5, since they are unmatched, then we start matching there.

That second step is important.  Say you were sorting 222 and 228, you an see that the first non-matching character is the ones place.  That means the left-side of the range doesn't matter for the purposes of generating our ranges.  The range match simply becomes 22[2-8], and there is only a single decomposition.  That means our decomposition is only based on the difference in the right side of the equation or rather based on the left-most digit that doesn't match.  Any missing left side digits should be replaced with 0's and used in this process.

The third step is to build range groups from the digit value to 9 as you walk the first string from right to left.  The fourth is to do the same for the second string only this time you build from 0 up to digit value.  This results in (digit-1)*2 decompositions.  The final decomposition or fifth step is to build the middle range.

3.  22[2-9], 2[3-9]\d
4.  55[0-5], 5[0-4]\d
5.  [3-4]\d\d

[Editorial Note: I just drank some death sauce and I'm experiencing a slight hot twinge in my mouth]

The fifth step is dependent upon the relationship between the upper and lower bound numbers.  If the difference between the two numbers is less than 2, then you leave out the final step, and it isn't needed.  If the difference is greater than 2, then you subtract 1 from the upper and add 1 to the lower to get your range.

No refactoring, no fluff, just a quick range creation algorithm for regular expressions.  I challenge you to convert the algorithm into code.  There are several ways to implement each of the steps, and maybe even some more interesting algorithms.

Sponsor
Published Saturday, May 22, 2004 6:40 PM by jrogers
Filed under:

Comments

 

TrackBack said:

May 22, 2004 10:44 PM
 

TrackBack said:

May 25, 2004 3:52 AM
Anonymous comments are disabled