Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Quick help with construction.

Last post 09-27-2012, 11:50 AM by Persican. 2 replies.
Sort Posts: Previous Next
  •  09-26-2012, 4:06 PM 86676

    Quick help with construction.

    Hi everyone.

     

    I'm trying to capture some group from this string:

     "VA90-258    37.2997   -75.8333      13.3   VIRGINIA COAST"

    For instance, I, using this regex
    '\\S+\\d+\\-\\d+\\W+\\d+.\\d+\\W+\\d+.\\d+\\W+\\d+.\\d+\\W+\\S*'

    which is certainly not elegant. Additionally, I fail to capture the last string VIRGINIA COAST. Once my regex is correct, I'll place place () to get 5 groups

    VA90-258   

    37.2997  

    -75.8333     

    13.3

    VIRGINIA COAST"

     

    Any help would be appreciated.

     

    With regards,

    Phil

     

     

     

  •  09-26-2012, 8:15 PM 86677 in reply to 86676

    Re: Quick help with construction.

    You don;t mention the regex variant that you are using, but I would be surprised if it has a different interpretation of '\W' and '\S' from the "usual" ones of "non-alphanumeric" and "non-white space" respectively.

    Going by your single example, I would suggest the following:

    ^            - force the regex match to start at the beginning of the line
    \w+-d+   - match one or more alphanumerics followed by a literal hyphen followed one or more digits
    \s+        - skip the whitespace
    [\d.]+     - match the first numeric value (see below)
    \s+        - skip the whitespace
    [\d-.]+    - match the 2nd numeric value, allowing for a negative
    \s+        - skip the whitespace
    [\d.]+     - 3rd numeric value
    \s+        - see the pattern emerging
    .*           - match any character, zero or more times
    $            -  force a match with the end of the line/string

    There aer alternatives available at nearly every step and the approach I have taken does the matching but doesn't try to validate the detailed structure of each item. For example, my suggestion to match a numeric value - '[\d.]+' or '[\d-.]+' will match any sequence of digits, hypens and decimal points including "3--..32.2-0". If  you REQUIRE the number to be formatted as digits, decimal point and more digits then what I have suggested can be replaced by '\d+\.\d+'.

    Similarly, if the negative number must be formatted correctly, then the pattern might be '-?\d+\.\d+' - the '-?' allows for the negative sign being optional.

    Note that I put a '\' before the '.'. By itself the '.' is "match any character, with the possible exception of NewLine" and so I have quoted it to force the interpretation of a literal period character.

    If you look at your pattern, you have quoted the '-' character. While this is not wrong, it si also unnecessary as the '-' character does not have any special meaning (except within a character set definition).

    Also the above line-by-line break should show you where the groupings should go. All together the pattern should like like

    ^(\w+-\d+)\s+([\d.]+)\s+([\d-.]+)\s+([\d.]+)\s+(.*)$

    You will need to set the "multiline' option if this is to be applied to each line of a block of text.

    You can skip what follows unless you want to know why your pattern doesn't work quite how you expect it to;

    The reason you don't get "VIRGINIA COAST" is that you are trying to match this with the '\S*' part of your pattern. '\s' will match any whitespace character and '\S' will match any non-whitespace character. Therefore the '\S*' will match all of the characters in "VIRGINIA" but will stop matching when you get to the space before "COAST"

    Also you have '\S+\d+-\d+' to match the start of the string. Aside from the fact that the match could actually start anywhere within the text (hence my adding in the '^') this will match one or more non-whitespace characters with the '\S+'. Therefore it will start by matching "VA90-258" and stopping on the following space. The next part of the pattern is '\d+' but this can't match the whitespace and so the regex engine will start to backtrack by releasing one character at a time until the '\d+' can succeed.

    In this case it will back off the "8' and let the '\d+' match it. However it then tries to match the '-' in the pattern with the trailing space and again fails. Therefore it backtracks again, releasing the "8" that the '\d+' matched and the '5' that the '\S+' matched. Now the '\d+' can match both digits but the '-' still can't match and so the "8" is again released from the '\d+' and the '-' again tried to match the next character.

    This back-track and try again process will continue until the '-' finally matches the "-" in the text. (Actually it is more complex than that as the '\S+' can match both the "90" digits but the first '\d+' will require at least 1 digit. Therefore the pattern elements will match the characters as follows:

    \S+    - "VA9"
    \d+    - "0"
    -        - "-"
    \d+    - "258"

    Therefore this part of your pattern will cause the regex engine to do a lot more work than necessary to perform this part of the match.

    In this case, you probably won't notice the additional processing time but this type of backtracking can lead to minutes or hours (and even infinite loops in pathological cases) of processing even moderate length strings. If it can be avoided then it should be and generally it can be if you have a clear understanding of what the valid characters can be at each step and construct your pattern accordingly.

    My suggestion above shows one way - it relies on the fact that the first characters (up to the "-") are alphanumeric and therefore the '\w' operator can match them all. When it gets to the "-" character it can continue the match without backtracking. you could also use something like

    [a-z]{2}\d{2}-\d{3}

    (with the "ignore case" option set) to match 2 alphabetic characters followed by 2 digits, the hyphen and 3 more digits. There are lots of other variations depending on what is a valid strng looks like.

    Susan

  •  09-27-2012, 11:50 AM 86681 in reply to 86677

    Re: Quick help with construction.

    Thank you very much for the explanations. You provided great information here and detailed answers.

     

    Regards,

    Phil

View as RSS news feed in XML