Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

You've got your sub-matches in my matches

Hello boys and girls. Wow it's been a while since I've done this. I want to touch on a very useful but often overlooked feature of regex, grouping. While I haven't been blogging I have been active on a message board here or there. A question I see quite often is “I want to find a match in a string but I don't want part of the match” or “I need the value of this portion of the string” Now often I see solutions to these type of questions that involve look-arounds. Any why they certainly work they aren't the only way to achieve the desired results. New regex users seem to believe they can only access the full match. Most regex engine support groups, where in a match you can access a certain portion of the full match. Groups are identified by parenthesis. Every pair of plain parenthesis is a group. For example the regex pattern


/(Hello)\x20(world)/


There are two groups in the regex. The regex itself matches the string “Hello world”, group 1 contains the string “Hello”, group 2 contains the string “world”. Note neither group contains the space between the two words but it is part of the full match. Most implementations of regular expressions allow you a ways to access these groups. They are contained in a collection inside the match object. Now you'll need to consult your regex documentation to know the exact layout but in most of these implementation there is a zero based collection where Item 0 is the full match and Item 1..n is whichever group element your regex contains, if any.


Now if you notice I said every pair of plain parenthesis is a group. The reason I stress plain is because there are other group constructs, the aforementioned look-arounds being some. Now support for the other constructs vary in implementations so again consult your regex documentation to see which ones you have. The other grouping constructs consist of a open parenthesis immediately followed by a question mark, which then is followed by the characters that define that particular grouping construct. Again consult your documentation to see which characters define what. I'm not going to go into all of them here but they basically fall into two categories. Capturing and non-Capturing. The plain parenthesis I've mentioned above are a capturing group. However capturing requires extra resources in some case you need the extra speed, but you still to group a certain part of the pattern together for either necessity or readability or both. This is where you'll what to using a non-capturing group (?:pattern), a open parenthesis followed immediately by a question mark followed immediately by a colon. The difference here is that the data matched in the group is not add to the collection of submatch in the Match object.

Taking our previous example and making the first group non-capturing

/(?:Hello)\x20(world)/


Where before we had two groups here we only have one. Group 1 contains the string”world”. Now this example is not a very practical use of a non-capturing group. Typically you'd use them in more complex regexes that have a grouping but you really don't care about the sub-matches.


One more quick thing about capturing groups basically each left parenthesis is the index of the sub-match in the groups collection. So if you have nested parenthesis count every (plain) left parenthesis to know which index to use to reference it. Some of the advance grouping constructs and regex options can affect the ordering but if you are using them hopefully you've read their effects so I won't go over that here.


/((Hello)|(Goodbye Cruel))\x20(world)/


The above regex has 4 capturing groups (not counting group 0). Can you find them? Now it should match either the string “Hello world” or “Goodbye Cruel world” Now I want to point out that not all the groups will participate in the match, but the are still part of the Groups collection. There will always be 4 groups, just one will always be empty. Which one depends on which string was matched.

If “Hello world” was matched the groups are

  1. Hello

  2. Hello

  3. (empty)

  4. world


If “Goodbye Cruel world” was matched the groups are

  1. Goodbye Cruel

  2. (empty)

  3. Goodbye Cruel

  4. world


in both case group 0 would be the full match


If you note in both case two groups contain the same value. Even if you need to know whether “Hello” or “Goodbye Cruel” was match, you certainly don't need to know it twice. Plus the inner parenthesis have different index you'd have to check if you want to use those. This is where you'd use the non-capturing group to simplify your groups collection.


/((?:Hello)|(?:Goodbye Cruel))\x20(world)/


Now we are back down to two groups. Group 1 contains either “Hello” or “Goodbye Cruel” depending on which string was matched. Group 2 always contains “world”


However keep in mind in some cases you'll want to use the inner index do determine which group was matched. So using non-capturing groups isn't necessarily a better thing it just depends on if you need to access those groups or not. But if you are not doing anything with them don't capture them.



These are just two of the basic grouping constructs and they are general supported across implementations of regex, but not always. But if they are you can use the to easily dissect larger matches.



Published Friday, June 01, 2007 11:47 AM by mash
Anonymous comments are disabled