Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Wayne's Regex Rants

  • Why look around?

    There have been a number of posts and replies for 'killer' reasons for lookaround (starting with Darren's query), and I want to drop a few thoughts into the fray.

    1. To do the work of multiple regexes, but within a single regex.  Michael Ash and Justin both hinted at this, demonstrating if not explicitly calling it out as such.  A fine example of this would seem to be a password validation regex, such as “verify at least 8 characters, and includes at least one digit, one letter, and one symbol”.  Use lookaround to test each independent and possibly overlapping requirement.  The length requirement needs to look at every single character, so it overlaps with each other requirement; the one-character requirements overlap with each other because they may appear in any order.

    One way:

    (?=.*\d)(?=.*[a-z])(?=.*[!@#$%]).*{8,}

    Alas, this isn't a 'killer' reason and I concede it is possible to compose an equivalent pattern without the use of lookahead, but it would be insane.

    2. To force a lazy match to occur.  Michael Ash hinted at this in his mention of entity-replacement.  Given input like:

    <tag1> lorem ipsum <a href=a>alpha</a> <a href=b>beta</a>
    <tag2> lorem ipsum <a href=z>zed</a>   <a href=y>why</a>
    where the number of links may vary, zero or more.  Acquire the text of each link, but only of the set of links that follow the <tag>.

    Here's the pattern I used:
    <tag(?'tag'\d) .*? (?=<a|<tag) (<a[^>]*> ('text'[^<]*) </a>)*
    
    RegexOptions.Singleline, IgnorePatternWhitespace, ExplicitCapture

    Is there an assertionless way?

    -Wayne

  • LookAhead LookBehind Subtleties

    Today I came across an example that nicely illustrates the use of a zero-width look-behind assertion. A zero-width assertion is an expression that may match text but does not consume any characters from the input string.

    I wanted to extract an entire <img> tag from a chunk of text, while also verifying that the tag was self-closed. A starting pattern might look like this:

    <img [^/]* />
    That will work except for the occasional img tag that includes an embedded forward-slash.

    One way to correct for that while keeping the pattern nice and simple is to use a zero-width assertion. Applying a look-ahead assertion would produce:

    <img [^>]* (?=/)>
    But that doesn't quite work. When applied to the sample text:
    <img src='foo' />
    the character class will consume characters up-to and including the forward-slash. Then the assertion is applied and 'looks ahead' for a forward-slash, but it finds only the greater-than, and so fails.

    The solution then, is to look backward to find the forward-slash:

    <img [^>]* (?<=/)>

    Note: The regex patterns above include extra spaces in them to make them more readable; remove the spaces or apply the 'IgnorePatternWhitespace' option when using them.

    -Wayne


    Taking another look at this, I see I got carried away with using the 'cool' look-behind functionality. For the specific problem I was trying to solve — that is, extract an entire <img> tag from a chunk of text while also verifying that the tag is self-closed — use of the look-behind feature is overkill. This 'traditional' regex will do that:

    <img [^>]* />

  • Speaking of Pronunciation

    I imagine that most people agree that 'regex' is better than 'regexp' when choosing an abbreviated word for 'regular expression'.  It's just easier to articulate without that troublesome letter p.  However, it still leaves pronunciation of the letter g open to debate.

    I think the letter g should be soft, not hard.  Or, via a pronunciation guide, it should be /réjeks/, not /régeks/.

    The intent is that it should be easy to pronounce, right?  Well, to my palate, the soft g is easier and rolls of the tongue better than a hard g.  The hard g causes a halt that seems a bit akin to a stammer.

    -Wayne

  • A Split Approach

    Ahh, my first blog entry...

    The Regex.Split() method offers a clean way to break a delimited string of text into substrings.  Its advantage over the standard String.Split() method is that you can use a delimiter that is variable or not precisely known.  A simple example suited for regex is a comma-delimited list where the delimiter may or may not be followed by a space, as:

    string[] fields = Regex.Split(delimitedList, ", ?");

    There are times, however, when the Split method is not the best choice for splitting a string.  Consider a comma-delimited string; where each field may be surrounded by quotes, and when quoted the field may contain embedded commas.  Example:

    Mister King,"123rd St, Redmond",425-555-1234
    and the split should produce:
    [0] => Mister King
    [1] => "123rd St, Redmond"
    [2] => 425-555-1234

    To use the Regex.Split() method, you would need to figure out some way to match a comma, except when it is within an opening-quote and a closing-quote. While it may be possible to devise such a pattern, it would be pretty tricky and certainly more complex than an alternative approach.  That alternative approach is to use the Regex.Matches() method with a pattern that explicitly matches each field instead of each delimiter.

    Regex rex = new Regex(
          //match quoted text if possible
          //otherwise match until a comma is found
          @"  ""[^""]*""  |  [^,]+  ",
          RegexOptions.IgnorePatternWhitespace);
    int i=0;
    foreach(Match m in rex.Matches(delimitedList))
       Console.WriteLine("[{0}] => {1}", i++, m.Value);

    If more than one delimeter is possible (e.g., commas or semi-colons), or if more than one set of quotes are possible (e.g., double-quotes or single-quotes), it is easy enough to enhance the pattern:

          @" ""[^""]*""  |  '[^']*'  |  [^,;]+ "

    But, it gets a little tricky if you need to allow for empty fields, such as input like:

    Mister King,,425-555-1234

    A la,

    Regex rex = new Regex(
          @" (?: ,|^ )  ( ""[^""]*""  |  [^,]* )  ",
          RegexOptions.IgnorePatternWhitespace);
    int i=0;
    foreach(Match m in rex.Matches(delimitedList))
       Console.WriteLine("[{0}] => {1}", i++, m.Groups[1].Value);

    We now match both the delimiter and the field that immediately follows it.  To accomplish that, the first grouping in the pattern matches either a comma (the delimiter), or the beginning of input (the first field does not have a delimeter preceding it, so we have to match beginning of input).  When we extract the fields from the collection of matches, then, we just need to ignore the delimiters we've matched.  The code above does that by making the delimiter-match a non-capturing group ((?: )); and by acquiring the field text from the capturing group instead of the entire text of each match.


    The above pattern works for all cases except when the first field is empty.  I'll leave dealing with that as a future excercise.  Hint: match the entire input with a single match, and capture all the fields into a capturing group.

    The above pattern matches a field and the delimiter that immediately precedes it.  A similar approach is to match a field and the delimiter that immediately follows it.  Note the similarities of this pattern and code:

    Regex rex = new Regex(
          @" ( ""[^""]*""  |  [^,]* )  (?: ,|$ ) ",
          RegexOptions.IgnorePatternWhitespace);
    int i=0;
    foreach(Match m in rex.Matches(delimitedList))
       Console.WriteLine("[{0}] => {1}", i++, m.Groups[1].Value);

    Note this approach doesn't eliminate the empty-field issue.

    -Wayne