Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Difficulties Parsing part of a text

Last post 12-16-2012, 7:07 PM by Aussie Susan. 1 replies.
Sort Posts: Previous Next
  •  12-14-2012, 1:52 PM 87173

    Difficulties Parsing part of a text

    I have this string in bold 1 - Q254061 ( Prova: FCC - 2012 - TCE-AM - Analista de Controle Externo - Tecnologia da Informação / Modelagem de Processos de Negócio (BPM) /
    BPMN (Bussines Process Modeling Notation) ;  ) and I am trying to get only the part that corresponds to Prova: FCC - 2012 - TCE-AM - Analista de Controle Externo - Tecnologia da Informação
     
    I am trying to figure out the regex to do that and so far I could not find it.
    I tried with this kind of thought to build the regex (?!^((\d)*(\s[-]\s)(\s\(\s)))(([^/])*(?![/]\w).*)
     
    First anchor the begin of the line with ^
    Then build the expression that corresponds to 1 - Q254061 ( that is, or at least I think it is, \d\s[-]\sQ(\d){6,7}\s\(
    Then negate everything until then using an exclusion group (?:^(\d\s[-]\sQ(\d){6,7}\s\())
    Then build the expression for the next exclusion group that comes after the first / character - (?:^[/](.)*)
    Then what I want to get from the expression - (.)*
    And then put it together -  (?:^(\d\s[-]\sQ(\d){6,7}\s\())(.)*(?:^[/](.)*)
     
    But I get no match.
    Where is my mistake? 
  •  12-16-2012, 7:07 PM 87178 in reply to 87173

    Re: Difficulties Parsing part of a text

    Your 2nd regex (near the bottom of your post) is almost there but you seem to have misunderstood some of the basic regex pattern operators and the way quantifiers work.

    Firstly, the quantifiers: in several places you have attached the quantifier to a match group where I suspect that you don't want this. When a match group is made to repeat, for almost all regex engines, any previous text will be over-written by the subsequent repetition. This means that if you use (say) '(.)*' to match "ABC" then all you will get back from the match group will be the "C".

    What happens is this; the '(' will open a match group, the '.' will start by matching the "A" and the ')' will tell the regex engine to close the match group, saving the matched character for later. Next it sees the '*' quantifier and so the regex engine will go back to the start of the previous item (in this case the whole match group) and try to match the next character. Therefore it will again see the '(' in the pattern and open the SAME match group as before, have the '.' match the "B" and then see the ')' in the pattern and so close the match group, writing the "B" over the top of the "A" that was there before.

    This repeats a 3rd time and so the "C" will overwrite the "B".

    I suspect that when you are using such a pattern you want to save ALL of the characters that are matched by the '.' (in this case). Therefore you need to apply the quantifier to the actual matching operator and not the group, as in  '(.*)'.

    The same goes for '(\d){6,7}' which probably should be either '\d{6,7}' if you simply want to match the digits, or '(\d{6,7})' if you want to capture them as well.

    OK, lets build up your required pattern.

    You want to match the "n - Qnnnnnn (" part: you pretty much has this right but I would suggest

    \d\s-\sQ\d{6,7}\s\(

    as you want locate this pattern as the "start" marker but you don;t seem to want to actually capture the characters.

    By the way, you have the wrong idea of an "exclusion" group. In a regex pattern you MUST account for every character between the first and last that you want to match in the entire pattern. You can't "exclude" characters. Also what you need to know is that there are operators that change their meaning depending on where (and how) they are used. In this case, outside of a character set definition which is how it is used here, '^' means "match the start of the text (or line is the "multiline" option is set)".

    What I believe you are thinking of is the use of the '^' character as a character set negation operator (when used as the first character after the '[' that opens the set definition). For example, '[aeiou]' is a character set definition that will match a single vowel character, and '[^aeiou]' is a character set definition that will match any single character that is NOT a vowel. (I'll assume the "ignore case" option is set to simplify matters). (To make matters worse, '[aeiou^]' is a character set definition that matches any vowel OR the "^" character - as the '^' is not immediately after the '[", it is interpreted as a character and not the set negation operator.)

    Therefore, if you want to simply "skip over the "n - Qnnnnnn (" part, the pattern we have above is all that you need: it will match but no specific text will be captured (except as part of the overall match - you can't ever get around that).

    Now we get to the part that you DO want to capture: everything after the "n - Qnnnnn (" and before the next "/". There are a couple of ways to do this, but following your lead, we can use '.*'. On its own, this will match every character (including the newline character if the "singleline" operator is set) through to the end of the text. That is because the '*' quantifier is greedy - in other words it will match as many characters as possible. Therefore a simple way of limiting how many characters are matched is to make the '*' quantifier lazy by specifying it as '*?'. (Note that this does not work on all regex variants; you have not told us which one you are using so this may or may not be possible in your case).

    The way to make this work is to add after the '.*?' the characters that you want to match "next" - the "/" in this case. Therefore the pattern for this part will be

    (.*?)/

    Note that I have created a  match group around most of this pattern as we DO want to capture all of the text that this pattern matches.

    Alternatively, if your regex does NOT allow for lazy quantifiers, then you can use a pattern such as:

    ((?!/).)*

    The way this works is to use the 'negative lookahead" - '(?!/)' - operator to see whether or not the next character is the "/" one we are after. If it is NOT, then that part of the pattern will succeed (this is a "negative" lookahead) and therefore will match a single character. This "test and match" process is repeated until the next character IS the "/" (or the end of the text is reached) in which case the whole sub-pattern will complete.

    In fact, we have done all we need to to match the part of the text you want. Therefore we can stop the pattern here.

    So the whole pattern is:

    \d\s-\sQ\d{6,7}\s\((((?!/).)*)

    or

    \d\s-\sQ\d{6,7}\s\((.*?)/ 

    (probably with the "singleline" options set) and match group #1 will contain the text you are after.

    Susan 

View as RSS news feed in XML