Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Match multiple words

Last post 03-30-2012, 12:47 AM by gman. 4 replies.
Sort Posts: Previous Next
  •  03-26-2012, 9:39 AM 84839

    Match multiple words

    I would like a single Javascript regex expression that is capable of matching these conditions. The word "CUSTODIAN" preceded by either "AS" or "THE" and trailed by the words "OF", "FOR" and "THE". Note that any of the preceding or trailing words are optional with "CUSTODIAN" being the only required word.For example the any of the following would match:

    JOHN DOE AS CUSTODIAN OF THE ...

    JOHN DOE AS THE CUSTODIAN FOR ...

    JOHN DOE THE CUSTODIAN OF ...

    Here is the regex that I have come up with which works to some extent but gives extra matches:

    \W*(AS)*\W*(THE)*\W*CUSTODIAN\W*(FOR)*\W*(OF)*\W*(THE)*\W*

     Example:

    inputname = "JOHN DOE AS THE CUSTODIAN OF THE MARY DOE TRUST"  returns a match of  ==> "AS THE CUSTODIAN OF THE ,AS,THE,,OF,THE" 

    I'm fairly new to RegEx so any help would be greatly appreciated.

    Gerald Duncan

  •  03-26-2012, 6:28 PM 84842 in reply to 84839

    Re: Match multiple words

    There are several things going on here and it might be easiest to separate them out.

    For a start, I suspect the matching process is actually returning an array of captures (which is typically what you want, especially given the pattern you are using specifies multiple capture groups). The normal matching process returns the complete matching string as capture group 0 (or element 0 in your array) - hence the "as the custodian of the". For each set of parentheses in the pattern, the regex engine will create a capture group and number then starting with 1 and increasing by one each time. Therefore, you have defined capture group #1 as "AS", #2 as "THE", #3 as "FOR", #4 as "OF" and #5 as "THE".

    In your test string, there is no "for" character sequence and that is why you have a null string being returned for capture group #3 in the output string.

    Therefore, if all you want is the complete matched string, then just use the first ( 'zeroth') element of the returned array.

    If you look carefully at the string returned in the first element of the array, you will see that it is actually followed by a trailing space. This is because you have a '\W*' right at the end of the pattern. (If you look carefully, there is also a space at the beginning of the actual string - you may not have seen it and may not have copied it in your posting - it is exactly the same thing caused by the '\W*' at the start of your pattern). The actual definition of the '*' quantifier is to "match zero or more occurrences of the preceding item, matching as many times as possible". The '\W' is the preceding item and will match non-alphanumeric characters such as the space character. This works quite well to account for the spaces between the optional words but will also capture any non-alphanumeric characters (such as line feed, punctuation etc) before and after the key words.

    In this case it might be better to use the zero-width assertion '\b' at the beginning and end of the pattern. This will make sure that "AS" matches just the word "as" and not the last part of (for example) "has", but will not include any spurious characters at the start or end.

    Once advantage of keeping the '\W*' to match between the optional words is that this will match the line-feed, carriage return and other characters that are used to split the required phrase over multiple lines.

    If you are wanting to match an optional work, then you may be better off using the '?' quantifier rather than '*'. For example '\W*(AS)*\W*' will match "asasasasas" but '\W*(AS)?\W*' will not match that string as it requires just a single instance of "as".

    So far I would recommend that you use the pattern:

    \b(AS)?\W*(THE)?\W*CUSTODIAN\W*(FOR)?\W*(OF)?\W*(THE)?\b

    Now if we consider the the (rather unlikely) test string that you have provided of "JOHN DOE AS THE CUSTODIAN FOR ..." (you probably would never have the ellipses but it is useful for illustrating a point) you get a match of

    AS THE CUSTODIAN FOR ...

    In this case you will see that the '\W*' operator after the '(FOR)?' is picking up not only the whitespace but also ellipses (and any other non-alphanumeric character as we talked about before). We have got rid of the trailing characters if the last matched word is "THE" but not otherwise.

    If you can do without having the optional words in a specific order, then you could use a pattern such as:

    (\b(as|the)\s+)*custodian(\s+(of|for|the)\b)*

    This will effectively apply the '\b' before whatever optional word comes at the start and after whatever optional words come at the end. It also checks only for whitespace characters between words. It does have the disadvantage that it will match something stupid such as:

    the as as the custodian of for the of the for

    Susan

  •  03-28-2012, 12:32 PM 84846 in reply to 84842

    Re: Match multiple words

    Awesome reply! You answered so many of my questions with one post! I like your recommendation of using the zero array element on this expression. 

    \b(AS)?\W*(THE)?\W*CUSTODIAN\W*(FOR)?\W*(OF)?\W*(THE)?\b

    My only question is how do you reference and use the zero element. I would like to save the 0 match to a var named regvar. How would that look?

    I really appreciated your comprehensive and informative reply. Thanks much.

    Gerald 

     

     

  •  03-28-2012, 6:43 PM 84848 in reply to 84846

    Re: Match multiple words

    I must admit that I really have no experience in using Javascript - my knowledge really only extends as far as knowing about the patterns it can use. Therefore what follows is mainly from performing Google searches.

    I assume that you are using the ".exec()" method of the RexExp object to perform your matching as this is the one that seems to create the output that you have described.

    The ".exec()" method returns an object that can be treated as an array. (It also has properties that describe other aspects of the capture such as ".index" which returns the offset where the match started in the string). Therefore all you need to do is to reference the "[0]"th element as in (completely untested and guessed at syntax!):

    var RegExpression = /my pattern here/;
    var LastCapture = RegExpression.exec("My Test String Here");
    var regvar = LastCapture[0];
    print regvar;

    (I'm sure that some JavaScript expert will correct me if I have got something wrong)

    Of course, you can also retrieve the text from any other capture group by specifying the appropriate index. If you reference an index of a capture group that did match some text, then you will get that text; on the other hand if you use an index of a capture group that didn't match any text then you will get a null string.

    There is one thing you should know about the text captured in repeated capture groups and that is that you only get the text from the last capture. For example, if you had the pattern

    /(\w)+/

    and the text

    "hello"

    then the '\w' will actually match first the "h", then the "e" and so on until it gets to the 'o". However the general way regex engines handle this is to over-write any previously captured text so the only character you will get returned is the "o". (This is often seen as a limitation of regex engines and as far as I am aware, only the .NET regex engine has the capability re-defined to access the individual "captures"). A way around this is to capture all of the characters in one pass and then perform a second pass on the captured text.

    I know that you are not facing this situation, but my experience is that, once you start using regex patterns, you will see other places where you want to use a regex as well and things soon escalate.

    Susan

  •  03-30-2012, 12:47 AM 84860 in reply to 84848

    Re: Match multiple words

    You really are great! The 0 reference was the missing piece. Thank you VERY much for all your help. You have definitely been a tremendous asset.

    Gerald

View as RSS news feed in XML