Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Need help parsing search parameters (w/ quotes and apostrophes)

Last post 11-18-2010, 6:35 AM by blurmy23. 6 replies.
Sort Posts: Previous Next
  •  11-16-2010, 11:51 PM 73922

    Need help parsing search parameters (w/ quotes and apostrophes)

    I'm looking for a regex that can parse the following search query... it'll ultimately be in PHP.

    label:foo label: bar baz label:'eat at joe's'

    into

    label:foo
    label: bar
    baz
    label:'eat at joe's'

    or

    label
    foo
    label
    bar
    baz
    label
    eat at joe's

    The following two are close but don't quite cut it thanks to the apostrophe in joe's:

    ('.*?'|".*?"|\S+)
    [^\s"']+|"([^"]*)"|'([^']*)'

    Any help would be appreciated.

    Thank you!

  •  11-17-2010, 12:15 AM 73923 in reply to 73922

    Re: Need help parsing search parameters (w/ quotes and apostrophes)

    Ah, I should also mention that this will need to support both standard spaces and Japanese spaces ( ) because the search will be in Japanese.
  •  11-17-2010, 12:25 AM 73924 in reply to 73922

    Re: Need help parsing search parameters (w/ quotes and apostrophes)

    You will always have problems with unbalanced apostrophes as in your example unless you can define some "rules" as to how they should be handled. Normally these things can be "detected" by counting whether an even or od number of quotes (single or double) follow some condition.

    I've assumed that what you are really trying to do is to split this on both whitespace and colons, but if if they occur within a quoted string (which I've defined in this case as an opening quote and everything up to a closing quote that is followed by the end of the line or whitespace). Therefore try:

    '((?!'(?=\s|$)).)*'|[^:\s]+

    with the 'singleline' option set as necessary.

    Susan

    PS: I didn't see your 2nd post while I was typing in my response - the \s should cover all whitespace but will depend on the way PCRE was compiled for your version of PHP (i.e. if Unicode support as added). If not, the you should be able to see how to extend my suggestion.

  •  11-17-2010, 12:38 AM 73925 in reply to 73924

    Re: Need help parsing search parameters (w/ quotes and apostrophes)

    Thanks for your quick response!  Is it easy to add double-quotes into the mix?
  •  11-17-2010, 1:36 AM 73926 in reply to 73925

    Re: Need help parsing search parameters (w/ quotes and apostrophes)

    > You will always have problems with unbalanced apostrophes as in your example unless you can define some "rules" as to how they should be handled. Normally these things can be "detected" by counting whether an even or od number of quotes (single or double) follow some condition.

    Would it help if I only want to treat apostrophes sandwiched by alphanumeric characters as apostrophes and not single quotes?

  •  11-17-2010, 5:34 PM 73954 in reply to 73926

    Re: Need help parsing search parameters (w/ quotes and apostrophes)

    Firstly, I need to point out that, while you use different words such as "apostrophe" and "single quote" to differentiate the *usage* of a character, such distinctions are meaningless to a regex. A regex will work at the level of individual characters and so, if an "apostrophe" and "single quote" are represented by the same character (i.e. having the same character coding) then the regex will see them as identical.

    Now, let me show you why regex patterns get very messy very quickly.

    Lets start with wanting to locate all spaces that are NOT between balanced single quotes:

    \s+(?=[^']*('[^']*'[^']*)*$)

    What this does is to locate a space and then use a lookahead to make sure that, if there are any single quotes following the space to the end of the line, that there is a balancing "end quote" for each one. (Note the assumption here that you are using the 'multiline' option and that there are no line breaks between the quotes).

    Now, let's bring in your distinction between an apostrophe (with an alphanumeric on both sides) and a single quote (which does NOT have an alphanumeric on BOTH sides). Using this,al apostrophe can be specified with the pattern

    (?<=\w)'((?=\w)

    The problem is that we want the opposite of this. Applying normal boolean logic we can get

    ((?<!\w)'|'(?!\w))

    which will match a ' character that does not have an alphabetic on either site. (The outer set of parentheses are included here to limit the effect of the alternation - we will be inserting this sub-pattern into the first one above in several places so we need to contain the low precedence of the '|').

    We can use this to match a single quote but not an "apostrophe". However, we have used the ' character within a character class in the first pattern and we cannot substitute this new pattern for the ' in one of those. Therefore we need to realise that the character set definition matches all characters except the ', and so we can use an alternate form of '[^']' which is:

    ((?!').)*

    assuming that the 'singleline' option is set so that we can get exactly the same matches as the character set original.

    Now we can substitute '((?<!\w)'|'(?!\w))' for ' and '((?!').)*' for '[^']*' in the original pattern and we get:

    \s+(?=((?!(?<!\w)'|'(?!\w)).)*(((?<!\w)'|'(?!\w))((?!(?<!\w)'|'(?!\w)).)*((?<!\w)'|'(?!\w))((?!(?<!\w)'|'(?!\w)).)*)*$)

    We can now use this in a regex "Split" operation as it will find any (sequence of one or more) white space characters that are followed by an even number of "single quotes" (including none) bearing in mind the distinction between a "single quote" and an"apostrophe".

    It is also possible to have this handle double-quotes as well as single quotes in 2 ways: the first is to substitute '['"]' where ever a ' character appears in the pattern but this also means that double and single quotes can be used interchangeable and, depending on your text, may lead to strange results, and

    Joe"s

    would be treated as an apostrophe. The second is to go back to the original pattern and make it:

    \s+(?=[^'"]*((['"])((?!\2).)*\2[^'"]*)*$)

    which will handle balanced single or double quotes correctly (and, incidentally does NOT have a problem with "eat at Joe's" !) and then go through the process I've outlined above to apply the different meaning to single quotes and apostrophes - I'll leave this as an exercise for the reader!

    I'll also let you decide if this answers your question "Would it help if..... "

    Susan

  •  11-18-2010, 6:35 AM 73972 in reply to 73954

    Re: Need help parsing search parameters (w/ quotes and apostrophes)

    Awesome... thanks for being so comprehensive!
View as RSS news feed in XML