Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Justin's Regex Blog

Thinking in Regex

What does conditional matching really mean in a regular expression?

Darren recently drew an example, but even though I'm Regex adept, I didn't find his meaning Conditional Matching. He basically shows a way to match Thur in Thursday or Thurday, but then if there is an s, make sure that day also follows. Now I would assert that you can tell what a regex is supposed to do by listing all of the things it matches. Turns out that his expression will either match Thursday, Thurs, or Thur. I don't know what his intention was, but I can talk a bit about what the conditional syntax buys you...

Let's start by taking a look at what might be a possible single level nesting block capture expression:

^\s*({)?(?(1).+?}|.+?$)

This is designed to match two types of statements. Single-line statements that match all of the characters on a single line, or nested blocks that exists between matching french braces. It isn't built for multiple levels of nesting, but instead just a basic pattern.

Foo
Bar
Baz
{
    Foo
    Bar
    Baz
}
Foo
Bar
{
    Foo
    Bar
    Baz
}

Now this appears to be an excellent reason to use conditional matching. First, we look for an opening brace. If we don't find one then we process one line of input using the else clause. If we do find one then we process the if clause. How would this look as a normal expression though? Can we gain any insight by looking at an expression that does the same thing without the use of the conditional? I sure hope so because it speaks volumes about what a conditional clause really buys you.

^\s*([^{].+?$|{.+?})

Above is one version of a possible expression. We specify an else clause by guaranteeing our condition won't be true. We'll match any character except for an open brace and then begin to match a single line statement. As for the if clause, it is simply rewritten as the second part of the alternation group. If the first character doesn't match the negative character class containing only the open brace, then the open brace must be there and we match it. This just shows off a basic if then. Basic if thens can always be turned into a two element alternation group.

You start to see more where conditionals are useful when the space between the match group (condition) and the conditional expression grows either large or complex. Anything that exists between the conditional has to be repeated within the expression. We'll call this the BEGIN..END expression.

^(BEGIN )?[a-zA-Z0-9][_a-zA-Z0-9]*\\([^\\)]+\\)(?(1) END)\\s*$

The BEGIN..END is optional in the expression, but as soon as you add one, you have to also add the other. The END is required whenever the BEGIN is seen. How would you build this requirement in using alternation groups? Well, you have to repeat the middle pattern. The bigger the pattern gets between the condition and the conditional the more pattern you need to repeat. In the above sample this can mean quite a bit.

^(BEGIN [a-zA-Z0-9][_a-zA-Z0-9]*\\([^\\)]+\\) END|[a-zA-Z0-9][_a-zA-Z0-9]*\\([^\\)]+\\))\\s*$

That makes conditionals pretty darn nice in my book since they save you a bunch of typing. Using the same conditional multiple times still reduces into a single alternation group. So a bunch of extra work isn't apparent in this scenario. Adding more than a single conditional has some very negative impacts on the final expression. You can either increase the number of alternations exponentially or linearly depending on how the conditionals are organized. Maybe I'll get into that complexity another time. A quick summary follows:

  • Conditionals allow you to select between up to two patterns based on the match success of previous patterns.
  • Conditionals can be dependent on either a previous numbered or named match group or on a pattern that is specified as the conditional.
  • A single conditional can be rewritten as an alternation group with two patterns.
    • The true or left pattern is normally a combination of the optional match group, the intermediate pattern, and the if clause
    • The false or right pattern is normally a combination of the intermediate pattern and the else clause
  • A single conditional used multiple times in a pattern can still be rewritten as an alternation group with two patterns
  • Multiple conditionals in the same pattern can increase the number of alternations linearly or exponentially or a combination of both
    • A conditional with a second conditional in the if clause increases linearly from 2 to 3 alternations
    • A pattern with two separate conditionals increases exponentially from 2 to 4 alternations (think binary truth tables)

 

Sponsor
Published Tuesday, August 10, 2004 2:02 AM by jrogers
Filed under:

Comments

 

jrogers said:

Good stuff. This certainly clear my idea of what a regex condition is and how it should be used.

I'm looking forward to reading more...
March 4, 2005 12:09 PM
 

TrackBack said:

August 10, 2004 6:08 AM
Anonymous comments are disabled