Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Understanding backtracking

Last post 07-27-2010, 12:47 AM by regexfan. 3 replies.
Sort Posts: Previous Next
  •  07-23-2010, 8:40 AM 69940

    Understanding backtracking

    Hi,

    Appreciate if somebody could help me understand the backtracking behavior of the following two regexps.

    Test String: mycar car

    regexp A: \b(\w+)\s+\1\b

    regexp B: \b(\w+\b)\s*\1

    Both fail to match the given test subject and thats fine, I anticipated that and now I'm analyzing the behavior (to learn differences between the two regexps).

    RegExp A backtracks to  (\w+soon after  \1  failing to match the letter  c  (in car) and ends parsing the string at that point.

    ie. (\w+)  doesnt seem to give back the alpha-numerics it consumed giving back till it reaches m, before calling quit. I expected  (\w+)  to give back till it reaches  m  as  (\w+)  needs only a single alpha-numeric to satisfy itself.

    RegExp B too backtracks to  (\w+)  after  \1  failing to match the letter  c  BUT it doesnt terminate the process there.

    (\w+)   continues to give back characters it continued till it reaches the ltter  m  before giving up matching.

    Question: It appears that    (\w+\b)      makes the regex engine to continue backtracking to the last letter   \w+   consumed while     (\w+)    simply doesnt give back anymore and stops right there.

    I'm using RegExBuddy to debug the expressions.

    Appreciate if somebody could help.

    TIA,

    RegExFan.

    P.S. regex flavor = .NET

  •  07-24-2010, 7:11 AM 70010 in reply to 69940

    Re: Understanding backtracking

    Hi Again,

    I analyzed the regexps again, this time with a different tool - RegExCoach and it appears that the regex engine behind the RegExCoach in fact forces (\w+) of RegExp A to give back up to the letter  m, before calling quit, similar to the (\w+\b) of RegExp B, which is NOT what is indicated by the RegExBuddy.

    According to the RegExBuddy, RegExp B fails to match the test string after 17 steps while RegExp A fails after 6 steps and the reason seem to be hidden (*mainly but not solely*) behind how the engine backtracks to the groups   (\w+\b)    and   (\w+).

    Is this difference due to some internal optimizations applied by the RegExBuddy engine?

    Your thoughts are appreciated. 

    TIA,

    RegExFan.

    P.S. It appears that RegExCoach is a regex-directed engine.

  •  07-25-2010, 8:35 PM 70077 in reply to 69940

    Re: Understanding backtracking

    I'm not able to explain how the individual regex variants behave, but you may well be right in that they apply varying levels of optimisation.

    Basically what happens is that both patterns force the first work to be anchored in some way at both ends. The initial '\b' means that the match must begin at a work boundary - either the beginning or end of a work. Therefore the only possible places where a match can start within "mycar car" is with the first character "m", the first "r" before the space, the "c" of the 2nd word and the last "r" in the string.

    The next pattern operator is a '\w' which means that there are only 2 places where  a match can start - the "m" of "mycar" and the "c" of "car".

    Taking the first of these, both regex patterns will initially match the same characters because the '\w+' will match all of "mycar" and stop before the space and this is the captured text. Because there is a space character in the pattern, the additional '\b' in the second pattern is not really relevant.

    Both patterns will then match the space and try to match the captured text and both fail.

    Both pattern will then try to backtrack by releasing the last character captured in group #1 - this is the "r" but both will immediately fail, the first because it requires a space character to be matched next (and the "r" is not a space) and the second because it requires a word boundary but has the "a" before it and the "r" after it. Both will continue to backtrack until all of the matched characters are returned and the pattern as a whole fails.

    I'm not sure how you have determined how each pattern is actually processed but, as far as I can see, both should be almost identical unless the regex engine is applying some optimisation such as "knowing" that any character captured by the '\w+' and released by the backtracking cannot then form a word boundary (which by definition is either the point between '\W\w' or '\w\W') and so skips the intermediate attempts.

    Also, what is your purpose in analysing this? The patterns (to me) seem to be too similar to see any differences, especially as you seem to be using the same regex variant?

    Susan 

  •  07-27-2010, 12:47 AM 70144 in reply to 70077

    Re: Understanding backtracking

    Hi Susan,

    First of all thank you for your clear and descriptive comments.

    As to the Qs you have raised: 

    (a) How did I determined the way engines processed the text? I analyzed the patterns using RegExBuddy and RegExCoach. RegExBuddy facilitates debugging a regexp by showing step by step how it matched the text with a given regexp. RegExCoach has this cool feature that walks you through step by step in the matching process. My understanding is "what they show is what they do", may be am wrong!

    (b) Why I go about analyzing these two patterns? Solely for learning purposes. Since RegExBuddy showed a different output (processing) for the two expressions I wanted to understand the reasons behind the difference.

    Thanks once again. 

    RegExFan

View as RSS news feed in XML