Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Match address street house number with pre-direction

Last post 05-02-2012, 7:48 PM by Aussie Susan. 1 replies.
Sort Posts: Previous Next
  •  05-01-2012, 10:45 PM 85065

    Match address street house number with pre-direction

    Hello,

    I am attempting to match street house numbers to include the pre-direction component. The house numbers may be expressed as a range separated by dash or house alpha designation. However, the trick is to be able to distinguish pre-directions from abbreviations embedded in the street name. To illustrate, please see the examples below where the characters in bold are the desired match.

    110-A N Oak ST

    1001 S.E. Elm DR

    13-15 S. Birch LN NE

    1515 Set N Sun ST

    This regex does some of the match but misses on the first example.

    ^\d*\W*\d*\W*(N|S)?\W*(E|W)?\W*(S)?\W*(W)?\W*\b

    Any help would be really appreciated.

    Gerald

  •  05-02-2012, 7:48 PM 85068 in reply to 85065

    Re: Match address street house number with pre-direction

    Starting with why your pattern does not correctly match your first example, let's expand the pattern piece by piece:

    ^                  - start at the beginning of the text (or line - you don't mention anything about using the multiline option)
    \d*\W*\d*      - match any leading digits, possibly followed by non-alphanumeric and more digits (#1)
    \W*(N|S)?     - match any non-alphanumeric characters followed by the literal characters "N" or "S" or nothing
    \W*(E|W)?    - ditto but the literal characters "E" and "W"
    \W*(S)?        - ditto but only the literal character "S"
    \W*(W)?       - ditto but only the literal character "W"
    \W*              - match any non- alphanumeric characters
    \b                 - match the start or end boundary of a word (#2)

    The direct answer to your question about the first example lies on the line marked #1. This will match the leading digits ("110" in this example) plus any following non-whitespace characters (the "-" here) and then try to match more digits (as in your 3rd example). As the next character is "A", the regex engine takes advantage of the fact that the quantifier allows a successful match to occur with no digits being matched.

    If you look at every other line in the pattern, it will only match the letters "N", "E", "S" and "W" in some combination. As the next letter is "A" it will skip down all of the remaining lines until it gets to the last. The '\b' is a zero-width assertion that effectively checked for either a non- alphanumeric character to the left and a alphanumeric character to the right (i.e. the start of a word) or the other way around. In this case we have already matched the "-" (which is the non-alphanumeric on the left) and the next character is the "A" (which is the alphanumeric on the right) - therefore the '\b' matches. As the pattern completes at this point, the whole match (of the "111-" characters) is declared.

    The next question is how to fix this? As I don't really understand the type of address that you are trying to match (I once worked for our local postal service as an analyst and we identified over 40 different types of address just in our little country) I'm not sure exactly how the first part of the address line should be structured. However I do have a few comments about your pattern that might help.

    The first is that it will match a line such as "Wombats"! The match is actually the null character(s) between the start of the line and the "W" but it is a successful match. This is because everything between the "^" and the "\b" is optional. I would recommend that you start the pattern with '^\d+' which requires that there be at least 1 digit right up front.

    Next I'd suggest that you separate the types of street address values from the "N/S/E/W" options. From your examples it looks like you can follow the leading digits with a hyphen and other digits or letters. Therefore something like '^\d+(-\w+)?' might be an appropriate first part of the pattern. At leas this matches the initial part of all of your examples (although I have not tested too many negative match examples - something I recommend you do).

    I'm guessing that what follows can be some combination of the "standard" major points of the compass such as "N", "NE", "E", "SE", "S", "SW", "W" and "NW" where the "N" or "S" comes first if there are 2 letters mentioned. Something like

    (N|S)?(N|S|E|W)

    would match this (although it will also allow combinations like "NN" and "NS" - you can extend the pattern in various ways if you really need this type of validation but I'm trying to keep things simple for now). Taking this further, you could also simply use

    (N|S|E|W)

    or 

    [NSEW]

    if you can accept "WN" and other combinations. We also need to account for other characters that can come between the compass point letters - in your examples you have "." characters but I'm also going to throw in whitespace as a demonstration of how to extend this:

    ([NSEW][\s.]?)

    To allow for the fact that there could be 0, 1 or 2 letters, we add in  a quantifier as in

    ([NSEW][\s.]?){0,2}

    Now, this sub-pattern will have a problem with your last example as it will recognise the "Se" of "Set" as indicating "S.E." and equivalent. You could address this by appending '\B' but this would break the other examples. The other option is to stop the '[\s.]' from being optional as in

    ([NSEW][\s.]){0,2}

    Puttng the whole expression together and allowing for there to be possible whitespace between the numbers and the directions you get

    ^\d+(-\w+)?\s?([NSEW][\s.]){0,2}

    which matches the bold parts of your examples (again, negative examples not tested)

    Susan

View as RSS news feed in XML