Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Justin's Regex Blog

Thinking in Regex

Performance: Character Classes versus Alternation Groups

If heated discussion with Michael Ash keeps resulting in this sort of performance break-through then I'll argue with him every day of the week. To follow the performance examinations you are going to have to make sure that both Compiled and ExplicitCapture are on. While it might be nice to also check the interpreted speed, I can imagine that when running in interpreted mode, most of the speed enhancements of the alternation group are completely lost.

Let's start the process by examining the process of matching the inputs of 10 and 12. First we'll do what I call finger matching using a character class. The expression we'll use is 1[02]. Finger matching looks something like this, though having some images would be better, I don't have the time to make them.

Op #: Description
Op 1: Input[0] = Pattern[0]; // Input = “11” for now
Op 2: Fail or Scan; // Fail on no equal, Scan next char on equals
Op 3: Setup Character Class Match; // Here we do math to set up a binary search
Op 4: Input[1] = CharClass[1]; // Because we have two elements 0 and 1, we'll try to match first 1 then 0
Op 5: Succeed or Input[1] = CharClass[2];
Op 6: Succeed or Fail;

Now, don't let the number of operations confuse you. Just because the character class matching looks really fast, it involves setting up a call stack, calling a method, scanning through a string representing the character class itself, and finally returning some true/false value. That can be costly, especially compared to branch and comparison statements. That puts us at an alternation group that does the same thing. It looks something like this 1(0|2). Remember that explicit capture is on. If not, you'll kill your perf. If you don't want explicit capture then at least prevent capture using 1(?:0|2):

Op 1: Input[0] = Pattern[0];
Op 2: Fail or Scan
Op 3: Input[1] = Alt[0][0]; // First alternation group, first check
Op 4: Succeed or Input[1] = Alt[0][1]; // We reach the comparison or success through a branch statement. Fast!
Op 5: Succeed or Fail

There is only one less conceptual operation, but several hundred less machine operations. We find that in the compiled scenarios running a worst case scenario match (aka a match that won't be found in the alternation group or character class) will examine the performance potentials. Here is the code for running against the digit class and the lower case character class (note that going against the lower case word character class is even slower because of Unicode support, so we use an explicit character class).

Regex regex23 = new Regex("^1\\d$", RegexOptions.Compiled | RegexOptions.ExplicitCapture);
Regex regex24 = new Regex("^1(0|1|2|3|4|5|6|7|8|9)$", RegexOptions.Compiled | RegexOptions.ExplicitCapture);

Regex regex25 = new Regex("^a[a-z]$", RegexOptions.Compiled | RegexOptions.ExplicitCapture);
Regex regex26 = new Regex("^a(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)$", RegexOptions.Compiled | RegexOptions.ExplicitCapture);

If you are going to run these under a performance scenario then the use string “1a” for the first pattern. This will match the start of the pattern, and then fail matching the remaining required digit. For the alternation group this will mean 10 branch comparisons. For the character class it means 3 comparisons and the setup of the call-stack. For the other pattern, use “a1” and the same thing will happen. Run them once outside of the loop to get the grease rolling. This makes sure all of the methods are JIT'ed. This is important. Lots of people don't prime the pump, but you need to make sure and call every method at least once and that means possibly exercising hundreds of code-paths in complex scenarios to make sure the perf results don't include JIT time.

// Compile the Shiznut
regex23.IsMatch(Data11);
regex24.IsMatch(Data11);
regex25.IsMatch(Data12);
regex26.IsMatch(Data12);

start = DateTime.Now;
for(int i = 0; i < 10000000; i++) {
    regex23.IsMatch(Data11);
}
end = DateTime.Now;
Console.WriteLine("Character Class [0-9] Elems {0}", end - start);

start = DateTime.Now;
for(int i = 0; i < 10000000; i++) {
    regex24.IsMatch(Data11);
}
end = DateTime.Now;
Console.WriteLine("Alternation Group (0|1|2|3|4|5|6|7|8|9) {0}", end - start);

start = DateTime.Now;
for(int i = 0; i < 10000000; i++) {
    regex25.IsMatch(Data12);
}
end = DateTime.Now;
Console.WriteLine("Character Class [a-z] Elems {0}", end - start);

start = DateTime.Now;
for(int i = 0; i < 10000000; i++) {
    regex26.IsMatch(Data12);
}
end = DateTime.Now;
Console.WriteLine("Alternation Group (a|b|...|y|z) {0}", end - start);

Just a basic set of loops. We are using IsMatch here rather than Match. Match creates some additional objects on the stack and we want to minimize the amount of memory thrash. The results are interesting to say the least. I am displaying the closest set of runs that I was able to capture giving the character class the benefit of the doubt. In my testings the character classes were very iffy in their results and timings fluctuate wildly. For the alternation group the timings are much more stable. I imagine the extra stack thrashing for character classes might have something to do with this timing difference.

Character Class [0-9] Elems 00:00:02.5636864
Alternation Group (0|1|2|3|4|5|6|7|8|9) 00:00:02.3533840
Character Class [a-z] Elems 00:00:02.4334992
Alternation Group (a|b|...|y|z) 00:00:02.3834272

Now with this knowledge, that alternation groups are somewhat faster, you can do some more work to get even better results. For instance, you can order the precedence of characters in the alternation group to more closely resemble the density of characters that you are matching against. For the english language this might mean matching vowels in the beginning. You'll get huge perf wins whenever the alternation group has a character matching a character in the input, but still nothing when the character is outside of the range.

There is one note. A huge peformance gain exists that isn't being taken advantage of. A linear character class such as [a-z] can be turned into two boundary checks. This would certainly save a lot of time. A complex class such as [0-9a-zA-Z] could be turned into 3 such boundary checks. Maybe Kit will grace us with a view and put this on a possible future performance improvement list. While this won't work for complex character classes especially those that are unicode aware, it does work for user specified classes.

Published Saturday, August 14, 2004 3:53 AM by jrogers
Filed under:

Comments

No Comments
Anonymous comments are disabled