Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

need advice

Last post 07-18-2012, 6:32 PM by Aussie Susan. 10 replies.
Sort Posts: Previous Next
  •  07-11-2012, 6:31 PM 85734

    need advice

    Here is a problem I have been banging my head at for a couple of days.
    I was wondering if anyone could help.

    I have this class of strings with this pattern:

    s1?=s2? s3 s4?          = always follows s1 only, and only if s1 is not empty

    s1 is  ({?\(?word( word)*\)?\}?)?  (it may be empty). If there is more that one word they MUST have () around them.
    word is made out of a specific set of characters like [%a-k]+ (it is not \w+ but this is irrelevant)
    If { is present so is }. Same for (). If they are both present {} are outside. (I don't it matters here but I might as well say it).
    5 examples: 'abc', '(a)', '{bbc}', '{(a bc def)}', '' (nothing)
    This is not possible: 'a bc' (no paren around 2 words), '(a}' (mismatched ({s ), '({a b c})' ({} and inside the parens).
    If s1 is not empty it WILL be followed by '='.

    s2 is like s1, it can also be empty

    s4 is like s1 except that it cannot have {}s

    s3 is either a single word or 2 or 3 words in parens.
    Ex: '%ab', '(daa KK)', '(abc DEF ghi)'
    If there are no parens ( () or } or =) to separate s3 from its neighbour then 1 space is introduced so words are not merged into a single word (e.g. 'a b c' would become 'abc' and we wouldn't know if it's 1, 2 or 3 words)

    Example of strings belonging to this "language" are.

    'a=b AND(c d e)'  :space added before AND
    'F x'             :s3 s4, space added after F
    '{(a plus b)}=X'  :s1=s4
    'a(ff COMP gg)(alpha %%%)'   :s2 s3 s4
    '{(many items%£)}={oblabioblada}(x ROWBOAT)(a b c d e f)'    :s1=s2 s3 s4
    'XYZ'             :s3 only as s1, s2 and s4 are all empty

    My problem is to find, in s3, either the 1st word if there is only one, or the second one if there are more (2 or 3).
    They are capitalized in the examples. Those are (let's call them) the MAIN words.
    There is no need to check for parens balancing, we know that they do. 

    I am using PCRE.
    I have found patterns to retrieve MAIN when s2 is empty and s3 is a single word but not the general case.

    This is not critical, I am doing this for fun, as I try to learn more about regexes.
    If anyone is enclined to give me advice I would appreciate it.

    Thanks

    Filed under:
  •  07-11-2012, 11:41 PM 85735 in reply to 85734

    Re: need advice

    First a couple of comments: if this is just for fun and to learn about regexes, then this is probably too complex for the purpose. Also, I think you may need to go back and review the "rules" for a valid sentence and your examples. Your "sentence" is of the form

    s1 = s2 s3 s4

    where s3 is required. However in your 3rd example of "{(a plus b)}=X" you state that the "X" must be "s4" - I'm assuming this is a typo and you mean "s3".

    I've also taken some liberties in the pattern I created by making the "spaces around s3" rule as required at both ends if there is no leading parentheses - you can extend my pattern to cater for this requirement better but, at this stage, all that would do is to complicate the pattern and obscure what I think is the mail point of your question: namely identifying the first or 2nd word in s3. Also I've used '\w' instead of whatever your real requirement is (again just to get something that works) and also modified your test strings slightly as a result.

    The pattern I have used is:

    ^((?>(\{|\()*\w+(?(2)(\s+\w+)*)\)?\}?=)?)
    ((?>(\{|\()*\w+(?(5)(\s+\w+)*)\)?\}?)?)
    ((?>(\{|\()*(?(8)|\s)\w+(?(8)(\s+\w+)(\s+\w+)?)\)?\}?(?(8)|\s)))
    ((?>(\()*\w+(?(12)(\s+\w+)*)\)?)?)
    \r?$

    with the multiline option set. I also have the "ignore cases" and "ignore whitespace" options set but they do not really impact the match process. The test strings are:

    a=b AND (c d e)
     F x
    {(a plus b)}= X
    a(ff COMP gg)(alpha 999)
    {(many items)}={oblabioblada}(x ROWBOAT)(a b c d e f)
     XYZ 

    (you will note that I've added a few spaces so that each line parses correctly)

    The basic pattern for each of the parts is:

    ((?>(\{|\()*\w+(?(5)(\s+\w+)*)\)?\}?)?)

    The outer set of parentheses is there to capture the full text of each part or be null if the part is not involved in the match.

    Within that is the group '(?>......)?' which is an optional group group that, if it matches, is greedy. This stops some problems with the "F x" example that would otherwise try to match as "s2 and s3".

    The next part is '(\{|\()*' and will match any leading parentheses (of either type) but, more importantly, provides us with a capture group that we can test to see the part as leading parentheses or not. (Note that the number of this group will change according to whatever else occurs in the pattern - this can be a great source of errors; as you are using PCRE you may make this a bit easier by using a named group).

    The '\w+' will capture the first word.

    The next bit uses a conditional group that tests if there is an opening parentheses - '(?(5)(\s+\w+)*)'. If there is a parenthesis, then we look to see if there are any number of spaces and following words.

    Then follows the test for the trailing parentheses. If you have not mentioned that you don't need to check the balance of the parentheses, then this part should have been included in the conditional group (but I got lazy!)

    In reality, I created 4 instances of this sub-pattern and added some other bits to check for the start and end of a line and used this as my starting point.

    The "s3" part is:

    ((?>(\{|\()*(?(8)|\s)\w+(?(8)(\s+\w+)(\s+\w+)?)\)?\}?(?(8)|\s)))

    It is basically the same as I have described above and if we strip away the parts we have already discussed at the start and end we get:

    (?(8)|\s)\w+(?(8)(\s+\w+)(\s+\w+)?)\)?\}?(?(8)|\s)

    The first part here is a check that, if there are NO leading  parentheses, then there must be a space. You will note that I'm using the conditional with a null "yes" part and only using the "no" part to force the whitespace match.

    While we are here, you will see a very similar pattern at the end where I have changed your rules a bit so that if there are no parentheses then we also need a trailing space. You can expand on this if you like to make this better fit your rules (i.e. whitespace required but only if there is a word character following).

    The '\w' is familiar and will capture the first (and perhaps only) word.

    Similarly, the '(?(8)(\s+\w+)....)?' is nearly the same except that, in this case the 2nd word is NOT followed by a quantifier and relies on the whole conditional being optional to allow for there to be a single work in parentheses. The capture group here will hold the 2nd word (if any) within the "s3" part.

    The '(\s+\w+)?' is the same and will match any 3rd word. (If you want to match more than 3then change the quantifier at the end).

    Now, when you get a match from the whole pattern, "s1" will be in match group #1, "s2" will be in match group #4, "s3" will be in match group #7 and "s4" in match group #11. Again focusing in on the "s3" part, if there is only a single word, then match group #9 will be blank and you can use whatever is in Match group #7 as the target word. If there are 2 or more words, then the 2nd word will be in match group #9 and you can use that directly.

    Does this all make sense???

    Susan

  •  07-12-2012, 6:52 PM 85748 in reply to 85735

    Re: need advice

    First of all thank you for taking the time to reply.

    You are right about the typo, I meant s3.
    And it does make sense.

    Your pattern looks a bit like the one I first came up with (mine didn't have conditions, assuming the input was always 

    "in the language") but, unlike mine, yours works a LOT better :) 
    There are cases where it fails but this is because of the modification you introduced about spaces around s3.

    There are 196 different cases here. I have tested them all. Only 56 fail because of the "no space" rule.

    I tried to modify your expression to use names and to accommodate the absence of space around the single MAIN word but 

    failed.

    Here is my new expression:

    ====================

    (?x)
    ^((?> (?<p1> \{|\( )* \w+ (?(p1)(\s+\w+)* \)?\}?)=)? )       # s1 (p1 not used)
     ((?> (?<p2> \{|\( )* \w+ (?(p2)(\s+\w+)* \)?\}?) )? )       # s2 (p2 used below)
     
     # s3
     ((?> (?<p3> \( )? (?(p3)|  # if it doesn't start with a ( then 
           (?(p2)|\s+) )        # if there was not a paren in s2 look for a space 
     # main section:
         (?(p3) \w+\s(?<main2>\w+)(?>\s\w+)? | (?<main1>\w+) )  
         \)? # closing ) will be there if opening one was there
     # next: we should be looking for a space or ( if we didn't have ()s
         (?(p3)|(?![^(\s]))     # no parens? cannot be followed by anything than ( or space        
     ))                         # (?![^ used instead of (?=[ to cover case where nothing follows

     ((?> \s?(?<p4>\()? \w+ (?(p4)(\s+\w+)* \)? )? ) )           # s4

     \r?$

    ===================

    Notes:

    a) s1, s2 are identical to yours, s4 is slightly different: I accept a possible space and expect no more than 1 '('

    b) In s2 I set p2:
    ((?> (?<p2> \{|\( )* \w+ (?(p2)(\s+\w+)* \)?\}?) )? )  

    and then use the presence of p2 to force s3 to see a space by doing
    ((?> (?<p3> \( )? (?(p3)|(?(p2)|\s+) )...

    but that seems not to work.

    b) I tried to keep the same name for MAIN1 and MAIN2 but this seems unfeasible.


    Why are you using \r? towards the end?

    /Dan


  •  07-13-2012, 12:54 AM 85752 in reply to 85748

    Re: need advice

    Let me take a step back: lets get the basic structure sorted first and then worry about capturing the correct bits of it.

    To do this, I want to stop possible confusion about what the <p1>, <p2> etc in your pattern is actually telling us. Therefore lets start with a slightly modified "sentence" matcher:

    (?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?

    This is basically what we had before except that it is really composed of 2 alternatives: the first (which is actually written last) is the simply '\w+' with no surrounding punctuation, and the other alternative is where we have the surrounding brackets that must be at the start and end and there must be at least one word in between and possibly more separated by a space.

    You will note that I've switched to using '\x20' instead of '\s' as, in some places, this was matching the end of line incorrectly - the '\x20' will only match a space instead of any whitespace character.

    Now the named group will match as a whole or will not - and we can test the name to make sure the sentence is given (or not). Previously, the name was telling us whether or not the sentence had punctuation - if there was none, then we could not use the name to differentiate between when the sentence was not there or just the punctuation was missing. This will become important when we get to handling the required "s3" case.

    Therefore, the simplistic  application of this sub-pattern to the whole language becomes:

    ^(?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?
    (?<s2>(?>[{(]+\w+(\x20\w+)*[})]+|\w+))?
    (?<s3>[{(]+\w+(\x20\w+)*[})]+|\w+)
    (?<s4>[{(]+\w+(\x20\w+)*[})]+|\w+)?
    \r?$

    In other words, sentence 1 must have a trailing "=" character, sentence 2 os optional, sentence 3 is required and sentence 4 in optional.

    My the way, in answer to your last question, the '\r?$' simply forces a match with the end of the string. The reason the '\r?' is there, is that  some platforms represent the end of the string as '\n' and others as '\r\n'. Also, some systems have nothing at the end of the text -  the '$' anchor is happy with this situation. This little pattern will match all of these situations.

    This actually matches all but the first 2 of your examples correctly, allocating the right text to each of the "sentences".

    The reason it does not match the first 2 cases is that there is no allowance for the spaces around sentence 3. If we go back to the basic sentence structure, we only have to worry about spaces when there is no punctuation. Therefore lets split the sentence up as follows:

    (?<s3>
          [{(]+\w+(\x20\w+)*[})]+
          |
          \w+
    )

    and we start work on the conditions where spaces need to be handled. The first is, if Sentence 2 is present (don't care about sentence1 as it has trailing punctuation anyway) then we need a space:

    (?<s3>
          [{(]+\w+(\x20\w+)*[})]+
          |
            (?(s2)\x20)
            \w+
    )

    At this point we actually match ALL of your examples, BUT, example 2 has the "F" matching sentence 2 and the "x" matching sentence 3. Therefore we need to do something to stop this situation from occurring.

    If sentence 2 is there, then we have just matched sentence 3 AND either there is a sentence four (which will need another space) or we should be at the end of the line.

    Therefore we ad in another part after the word match:

    (?<s3>
          [{(]+\w+(\x20\w+)*[})]+
          |
            (?(s2)\x20)
            \w+
            (?(s2)(?!\r?$)|(\x20|\r?$))
    )

    What this part says is, if sentence 2 has been given and we have just matched sentence then we can't be at the end of the line: otherwise we should have matched this as sentences 3 and 4, not 2 and 3. On the other hand, if there is NO sentence 2, then we must either have a space before any sentence 4 or we are at the end of the line.

    Putting this altogether, we get:

    ^(?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?
    (?<s2>(?>[{(]+\w+(\x20\w+)*[})]+|\w+))?
    (?<s3>
          [{(]+\w+(\x20\w+)*[})]+
          |
            (?(s2)\x20)
            \w+
            (?(s2)(?!\r?$)|(\x20|\r?$))
            )
    (?<s4>[{(]+\w+(\x20\w+)*[})]+|\w+)?
    \r?$

    Now, let's dig further in to sentence 3 and capture the main word that you are after. To do this we can create a simply named group around our loan '\w+' as in '(?<Main>\w+)'.

    However the way we have written the first alternative (with the punctuation) is not quite right in this case. Elsewhere, we don't care about any specific word, but in THIS case you want to capture the 2nd word only. Therefore we need to change this part to '[{(]+\w+(?<Main>\x20\w+)(\x20\w+)*[})]+'.

    You will see that I have used the same name - I tested this in a regex test platform that uses the .NET variant but the same is true for PCRE in my experience. When you have named groups, the regex engine is happy to find the same name used in multiple places and will simply add the captured text into the named "slot' Therefore the named capture group "Main" will receive either the first and only word, OR the second (of 2 or more) word.

    Therefore the final pattern I've used is:

    ^(?<s1>(?>([{(]+\w+(\x20\w+)*[})]+|\w+)=))?
    (?<s2>(?>[{(]+\w+(\x20\w+)*[})]+|\w+))?
    (?<s3>
          [{(]+\w+(?<Main>\x20\w+)(\x20\w+)*[})]+
          |
            (?(s2)\x20)
            (?<Main>\w+)
            (?(s2)(?!\r?$)|(\x20|\r?$))
    )
    (?<s4>[{(]+\w+(\x20\w+)*[})]+|\w+)?
    \r?$

    and the captures are:

    s1 = whatever matched sentence 1, ditto s2, s3 and s4; "Main" will contain the target sentence 3 word.

    One thing you may want to do is to trim any extra whitespace that may be included in the captured text of sentence 3.

    Susan

  •  07-13-2012, 7:14 AM 85753 in reply to 85752

    Re: need advice

    Wow! I'll have to sit down and go over that slowly over the weekend :)

    But quickly I tested your last pattern against all 196 cases and

    1. the same name cannot be repeated with my engine (I had tried that before) so I renamed them

    2. it still doesn't match a single word by itself (in fact none of the cases) but I'll come back later with more details probably Monday

     Thanks again.

    /Dan 

  •  07-15-2012, 11:29 PM 85771 in reply to 85753

    Re: need advice

    OK, I had a closer look.
    My last message said you were missing all cases, that's because, silly me, I had forgotten to put '(?x)' in front of your pattern.
    In fact it only missed 35 cases. For ex 'FFF(A)' should have been deemed OK.
    But again that's my fault, I should have explained that the space around s3 is only necessary to separate it from s2 or s4.
    If s2 or s4 already use brackets then there is no need to have a space on that side.

    So I modified your pattern to do just that and it works and covers all cases. Thank you. Very instructive.

    I also made a couple more change. In case you care here it is.

    Here is s1; since I don't care about what is matched I wrote:

    ^.*?=? 

    Another thing, parens are needed if there is more than one word but a single word can have parens too.
    In s2 a single word can be surrounded by either/and {} and ()s. Details.

    I made all \w++ atomic, this covers the case ABC which could be seen as A B C the way s3 was written. I also removed the 

    OR:

     (?<s2>(?<b2>\{)? \(? \w++ (?(b2)(\x20+\w++) \)? *\}) )?


    In S3 I moved the space outside and added a condition that says that with parens it cannot be the end.

    The rest is the same. Final result:
    ===========

    (?x)
    ^.*?=?
    (?<s2>(?<b2>\{)? \(? \w++ (?(b2)(\x20+\w++)* \)? \}) )?
    (?<s3>
          \( \w++ \x20 (?<Main>\w++) (\x20\w++)* \) (?!\x20*\r?$)
          |
            (?(s2)\x20*)
        \x20* (?<Main2> \w++) \x20*
            (?(s2)(?!\r?$)|(\x20*|\r?$))
    )
    (?<s4>\(\w+(\x20\w++)*\)|\w+)?
    \r?$

    ===========
     
    Again, thank you.
    /Dan 
  •  07-16-2012, 9:01 PM 85784 in reply to 85771

    Re: need advice

    Good to hear that it is working.

    A word of warning about the '.*?=?' construct. The initial situation is that this will match nothing and then go on to try to match the test of the pattern. Only if that fails (probably because of the "=" not matching anything else) will the regex engine backtrack to this part and try to match one character before again attempting to match the rest of the pattern. Again, only if that fails will it backtrack to the start, match the 2nd character and go through the whole process again,  and again and again until it can match the '=".

    It might be better to try something like

    ^(((?!=).)*=)?

    which will initially scan forward looking for the "=" - if there is none then it will scan the whole text, backtrack once and then move on. If there IS a "s1" then it will be found and the rest of the pattern scanned.

    This may or may not be an issue for you - it all depends on your circumstances. In general I try to stay clear of backtracking unless it is needed as it has the potential to really mess with your performance. If you are doing this once or twice then it doesn't matter, but if you are parsing many 1000's of sentences each second, then you want to keep the processing as "lean and mean" as you can.

    In a completely "rough and ready" and totally unscientific test, I compared the 2 methods: for 10 (identical) matches did involve an "s1" sentence, my test platform recorded took 0.002 sec for my method and 0.003 sec for yours. I repeated the test with a line of text that didn't have an "s1" part and got basically the same results. In this case it is a tiny difference but it might multiply up if you are not careful.

    Susan

  •  07-17-2012, 1:23 PM 85794 in reply to 85784

    Re: need advice

    I understand.

    What about "^([^=]*+=)?", wouldn't that also be faster?

  •  07-17-2012, 10:36 PM 85796 in reply to 85794

    Re: need advice

    I think you mean '^([^=]*=)?' (the '*+' would be an illegal quantifier).

    As to the speed aspect, it probably makes little difference: both will need to process either to the end of the line and then backtrack to the start, or will find a match with the "=" and move on from there.

    The only thing you might like to try is to make the group "greedy" so that if the "=" is NOT found then the backtracking will all be in 1 step rather than character by character.

    As always, when you get to the questions about optimisation, the real answer is "does it make a significant difference in the circumstances?". If not, then don't bother as you can spend a lot of time trying to shave off a few mSec - if that doesn't make a real difference to what you are trying to do, then why spend the time.

    Susan

  •  07-17-2012, 11:22 PM 85798 in reply to 85796

    Re: need advice

    I am not concerned about speed here. It is more at an informative level.

     "^([^=]*+=)?" seems to work on my PCRE engine and + means 'atomic' here.

    I want a series of non "=" followed by "=". I that fails I don't want to backtrack. 

    And since the default is greedy I don't need to worry about that (or did I misunderstand you?) 

  •  07-18-2012, 6:32 PM 85811 in reply to 85798

    Re: need advice

    Thank you for the information about the PCRE "possessive" quantifier - I was unaware of that syntax variation but as I can't see it in Perl or the .NET regex pattern specification, I'm assuming that it is a PCRE-specific syntax extension. Anyway, it is doing what I suggested anyway.

    As for my use of the term "greedy", I probably should have said 'atomic" but, unfortunately the term "greedy" means something slightly different when applied to a quantifier and to a group. As  quantifier, it means "match as many times as possible" and is the opposite of "lazy". However in a group context, it means the same as "atomic" in that it prevents the backtracking within the group. I guess I just used the term that had meaning for me and didn't think anything more about it - sorry.

    Susan

View as RSS news feed in XML