Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

regexp is being too greedy

Last post 11-06-2009, 12:01 AM by steveybaby. 4 replies.
Sort Posts: Previous Next
  •  11-04-2009, 9:13 PM 57154

    regexp is being too greedy

    I have the following multiline HTML:

     <SPAN><INPUT id=notthisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Guinevere" href='BLOCKED SCRIPTPC_7("doclist" , "09006da78003h2jd" , "13"
    <SPAN><INPUT id=thisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Sir Lancelot" href='BLOCKED SCRIPTPC_7("doclist" , "09006da78003f2fd" , "13"

    In this example there are two instances I am searching across - but there can be many more, e.g.

     <SPAN><INPUT id=notthisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Guinevere" href='BLOCKED SCRIPTPC_7("doclist" , "09006da78003h2jd" , "13"
    <SPAN><INPUT id=reallynotthisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Sir Lancelot" href='BLOCKED SCRIPTPC_7("doclist" , "09006da78003f2fd" , "13"
    <SPAN><INPUT id=thisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Sir Lancelot" href='BLOCKED SCRIPTPC_7("doclist" , "09006da79123f2fd" , "13"

    I need to be able to identify the ID of the INPUT tag that proceeds a given 0900 number. For example, if I have 09006da79123f2fd I want to get "thisone".

    I've tried the following:

    text =~ /INPUT id\=(.*?) onclick(.*?)09006da79123f2fd/m

    But that returns the FIRST id - the regexp is matching EVERYTHING from the first "INPUT" through to the given 0900 number...it is matching:

    notthisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Guinevere" href='BLOCKED SCRIPTPC_7("

    doclist" , "09006da78003h2jd" , "13"
    <SPAN><INPUT id=reallynotthisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Sir Lancelot" href='BLOCKED SCRIPTPC_7("doclist" , "09006da78003f2fd" , "13"
    <SPAN><INPUT id=thisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Sir Lancelot" href='BLOCKED SCRIPTPC_7("doclist" , "

     instead of:

     thisone onclick=blah type=checkbox name=PC_71234
    <TD class=really>
    <DIV class=cell><SPAN=dir><A title="Sir Lancelot" href='BLOCKED SCRIPTPC_7("doclist" , "

     

    Does anyone have any ideas what Im doing wrong? I am using the ruby regexp engine.

     

     

     

     

  •  11-05-2009, 5:00 PM 57173 in reply to 57154

    Re: regexp is being too greedy

    Your pattern will continue to try to match the text until it matches all of the string at the end of the pattern which, according to what you have given us, is "09006da79123f2fd". There are no instances of this in your first example and 1 in the second. Therefore what you are getting is exactly what you are asking to get according to your pattern.

    If you want to stop at the first instance of "0900" then only use those characters at the end of your pattern. Using:

    INPUT id\=(.*?) onclick(.*?)0900

    with the 'singleline' (not the multiline option as you have shown in your pattern - I think I'm reading the "/m" at the end correctly even though I'm not that familiar with Ruby) then I get 2 matches with your first example text and 3 with your second.

    Susan

  •  11-05-2009, 7:24 PM 57178 in reply to 57173

    Re: regexp is being too greedy

    Hi - thanks for trying to help. I dont think you understand what im trying to do. I have the text that matches the 0900xxxxxx string and I want to get the ID that is related to it. E.g. if I have the first one - 09006da78003h2jd, I want to get "notthisone"

    If I have "09006da78003f2fd" i want to return "reallynotthisone"

    And so on. So if I have the nth instance of "0900xxxxxxx" then I want the nth instance of "ID"

    Does that make better sense?

     

     

     

  •  11-05-2009, 11:38 PM 57180 in reply to 57178

    Re: regexp is being too greedy

    OK, now I think I understand. Try:

    INPUT id\=(((?! onclick).)*) onclick(((?!<SPAN>|0900).)*)09006da79123f2fd

    substituting whatever the appropriate string is at the end of the pattern.

    Although you were using the '.*?' form in  both places within your pattern, what was happening is that the "INPUT id" part was matching the first instance in the text, the '(.*? onclick)' part was matching to the first "onclick" instance and then next one was continuing to match until it could get to the specified character string at the end, even if this meant passing over the "end" of each entry.

    Therefore I've used the "<SPAN>" tag as a rough way of identifying when I have gone too far after the "onclick" text. I've used this as it appears to work given the sample text you have provided, but it doesn't seem to be valid HTML (it's missing several ">" characters at the end of the tags at least) and so I'm not sure what your source text really looks like. You may be able to change the first alternate in the negative lookahead to something more appropriate.

    What the last part does is to scan forward a character at a time until it either gets the "end" marker or the start of the number that is wanted. If it is an end marker, then the pattern will fail and the regex engine will backtrack its way to fail the whole match (see below) and so will reject that "INPUT id=" match. If it is the "0900" text then it will check to see if it is followed by the required characters, an do the same backtracking as mentioned before, or complete the whole match which is what is wanted.

    During the backtracking process, the regex engine will come across the first '(.*?)' part of the pattern and will then use that to try moving forward again, this time matching everything to the NEXT "onclick" text which is actually in the next entry. This is also not what is wanted, so I've used the same "trick" to stop when the the "onclick" text is found. If we ever backtrack to this part of the pattern, then the negative lookahead here will not allow the regex engine to move farward and will reject the whole entry. Having done that it will do its usual character-by-character advance, looking for another place to start the match which will be in the next entry along and the process will start all over again.

    Susan

     

     

  •  11-06-2009, 12:01 AM 57181 in reply to 57180

    Re: regexp is being too greedy

    WOW! Thank you Susan - that is brilliant. I have read your explanation several times and still can't get my head around how this works exactly, but work it does!

     Many thanks for your help, much appreciated.

View as RSS news feed in XML