Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Multiple tasks in a single regex (searching word list)

Last post 03-15-2012, 5:55 PM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  03-14-2012, 11:40 AM 84776

    Multiple tasks in a single regex (searching word list)

    I am writing a regex to search a list of words for words with specific characteristics.

    A complex example of this might be something like: Any word that contains 8 letters, doesn't contain the letter S, and contains one or two vowels.

    I'm presently accomplishing this with two regexes and programming outside the regex engine, like so: (pseudocode)

     

    regex1 = "^[a-rt-z]{8}$"

    regex2 = "[aeiou]"

    for each word in wordList,

    if regex_match(regex1,word) is true,

      if regex_match(regex2,word).match_count == 1 || 2

      print "word " + word + " matches."

     

    So by this rule I can easily search out the words that are 8 characters and not containing "s" with a single regex, and the regex engine alone will only return valid words. However, the second step - the vowel search - seem spossible but as you can see has to be accomplished with help from the hosting programming environment.

    Can regex be used flexibly enough to accomplish this task using ONLY regex such so that the match list returned will be the words matching the specification?

    Thanks,

    F

     

  •  03-14-2012, 7:07 PM 84777 in reply to 84776

    Re: Multiple tasks in a single regex (searching word list)

    In general the answer is to use a lookahead because they, in effect, let you check 2 separate things from the same starting position.

    Normally the regex engine maintains a pointer into the text that is incremented each time the "current" character (i.e. the one being pointed to) matches the current part of the regex pattern. The only way the pointer is moved backwards is via the backtracking mechanism but that only gets invoked when the pattern cannot be matched and the regex engine is searching for other possible ways of matching the text to the pattern.

    The lookahead mechanism can be thought of like a "subroutine" in that the text pointer is saved, the lookahead pattern is processed and, when it is complete, the previous text pointer is restored.

    Therefore you could use a pattern such as:

    ^(?=[^aeiou]*[aeiou]([^aeiou]*[aeiou])?[^aeiou]*$)[a-rt-z]{8}$

    HOWEVER, you have not mentioned the regex variant you have used and, while the above pattern doesn't use anything too exotic, there is no guarantee that it will work with your regex variant or it may require some syntax changes to work. Also this assumes that same option settings (such as "ignore case") as you are using in your examples.

    The lookahead is a bit messy to allow for the 1 or 2 vowel situation. You could also write it as:

    ^(?=([^aeiou]*[aeiou]){1,2}[^aeiou]*$)[a-rt-z]{8}$

    but this effectively is the same thing.

    Susan

  •  03-15-2012, 9:35 AM 84782 in reply to 84777

    Re: Multiple tasks in a single regex (searching word list)

    Lookahead seems to be exactly what I need for this particular task. By the way I'm using PCRE.

    I have another situation similar to the above that I'm thinking MIGHT be accomplish-able by lookahead, but the expression may start to get unwieldy.

    Suppose I have the same situation - in this example, a word must be 10 characters long - AND it must contain EXACTLY ONE of a certain letter, and EXACTLY TWO of another.

    I'm thinking I could do a lookahead similar to (?=[^x]*x[^x*]$) (assuming the first letter is X) and repeat that same lookahead for the second letter, adding () and {2} appropriately?

    So say: ^(?=[^x]*x[^x]*)(?=([^e]*e){2}[^e]*)[a-z]{10}$ 

    Thanks for the help!

  •  03-15-2012, 5:55 PM 84784 in reply to 84782

    Re: Multiple tasks in a single regex (searching word list)

    Got it in one! However be careful of '(?=[&x]*x[^x]*)' without adding in the '$' at the end. This will actually match the "abxyz" portion of "abxyzxpqrxfgh"; in other words it will stop at the 2nd "x" and declare a match.

    Yes, the expressions can become rather ugly but the key thing is that they work. Life is a little easier is your criteria "at least n" rather than"exactly n" occurrences because you can then use patterns such as '(?=.*?x)'.

    Also, this is the reason some people like to use the "x" option which allows them to split the pattern over several lines in  their source text so that they can see the various parts more clearly - this aids development and maintenance. However, as long as you keep everything in sync, you can achieve pretty much the same thing by adding good comments to your code.

    Susan

View as RSS news feed in XML