Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Find Inner Text Between Key Words (non HTML or XML)

Last post 11-14-2008, 3:53 PM by sindizzy. 24 replies.
Page 1 of 2 (25 items)   1 2 Next >
Sort Posts: Previous Next
  •  11-01-2008, 4:12 PM 47848

    Find Inner Text Between Key Words (non HTML or XML)

    I am attempting to find inner text in between two key words for a text file that is not HTML or XML. The text file would have text like so:

    D001  0950 X R H                                              
     ROUTE POINTS  STA1  -AAA  -SDR  -AAA -AAA11-BBB  -DDD2-
                   DDD3-CCC-DDEE-RRRR-WWWXX-YYYY-ZZ023-ZZZZZ-
                   STA2  -                                         
                                                                  
     DELIVERY TEXT     SAT1.STOP1.AAA.RT44.BBB.RT4.CCC.STOPX.STA2
                                                                  
     ROUTE TEXT    STA1 AAA.RT44-BBB.RT4-CCC STA2                     
                                                                  
     REMARKS 

    D035                                                
     ROUTE POINTS  WW1-WW2-DTT  -AA2  -BB2  -RRR  -QQQQQ-AAQQ-
                   HHH  -MMMM-NNNN-LLLL-KKKKK-KKKWW-QQAA-
                   BPPPPPASBL-STA2  -                                   
                                                                  
     DELIVERY TEXT     STA1.STOP5..AA2.DCT.BB2.RT100.HHH.STOP7.STA2      
                                                                  
     ROUTE TEXT    STA1 AA2-BB2.RT100-HHH STA2                        
                                                                  
     REMARKS  TRAFFIC AVOIDANCE
     

     And am trying to get everything between D### and the end of REMARKS (keywords bolded). I have a pretty good start. I have this currently and am using VB.NET2005 to parse the text.

    D\d{3}[A-Z]{0,1}(.|\s)*?REMARKS

    However Im struggling with two things.

    1. How can I capture all the way to the end of the REMARKS line? I was trying a (.|\s)*?(?=\r) at the end of the regexp and that seems to work but is that the best way to do it?

    2. Sometimes there are route points that match the route id (bolded) and that may be slowing the matching. I say this because even on small text files this regexp seems to take a while (10-15 seconds). Im thinking I should do some sort of negation or non-matching in between my two keywords but am unsure how to add it.

    Any suggestions welcome.

    AGP

  •  11-01-2008, 5:03 PM 47850 in reply to 47848

    Re: Find Inner Text Between Key Words (non HTML or XML)

    Try this regex:

    (?s)D\d{3}(?:(?!REMARKS).)*REMARKS

  •  11-01-2008, 5:21 PM 47851 in reply to 47850

    Re: Find Inner Text Between Key Words (non HTML or XML)

    I am not versed in vbnet (as I use PHP), but here is what I did:

    #D\d{3}(.+?REMARKS[^\r\n]*)#s

     

    To explain, the # characters are my chosen delimiters.... (you probably don't use those) so I look for a D followed by 3 digits.. then I use lazy quantifiers to capture everything upto REMARK, then I include REMARK and anything that is NOT a newline / returncarriage zero or more times. The s modifer after the closing # delimiter is a dotall class that includes newlines so that I can use preg_match_all and capture every instance of what the patter seeks..So my pattern outputs the following:

     

    Array(    [0] =>   
    0950 X R H                                                
    ROUTE POINTS  STA1  -AAA  -SDR  -AAA -AAA11-BBB  -DDD2-               
    DDD3-CCC-DDEE-RRRR-WWWXX-YYYY-ZZ023-ZZZZZ-               
    STA2  -                                                                                                          
    DELIVERY TEXT     SAT1.STOP1.AAA.RT44.BBB.RT4.CCC.STOPX.STA2                                                                
    ROUTE TEXT    STA1 AAA.RT44-BBB.RT4-CCC STA2                                                                                      
    REMARKS      
    [1] =>                                                   
    ROUTE POINTS  WW1-WW2-DTT  -AA2  -BB2  -RRR  -QQQQQ-AAQQ-               
    HHH  -MMMM-NNNN-LLLL-KKKKK-KKKWW-QQAA-               
    BPPPPPASBL-STA2  -                                                                                                    
    DELIVERY TEXT     STA1.STOP5..AA2.DCT.BB2.RT100.HHH.STOP7.STA2                                                                       
    ROUTE TEXT    STA1 AA2-BB2.RT100-HHH STA2                                                                                         
    REMARKS  TRAFFIC AVOIDANCE 

    )

     

    Not sure if this is what you are looking for. Again, sorry its in php (as I don't know vbnet).. but perhaps my pattern (modified to work in vbnet) will work well for you?

     

    Cheers,

     

    NRG 


    If preg was a woman, I'd marry her. But I could just see the pattern... she would replace me with someone with money ($1).
  •  11-01-2008, 5:24 PM 47852 in reply to 47850

    Re: Find Inner Text Between Key Words (non HTML or XML)

    If you have a way that is working that is fine, but I think it would help you to understand grouping. http://regexadvice.com/blogs/mash/archive/2007/06/01/You_2700_ve-got-your-sub_2D00_matches-in-my-matches.aspx where you can simply match the text that includes your boundary words then just pull out the part you need

    D\d{3}([\s\S]+?)REMARKS

    matches text that includes your keyword boundaries, but group 1  contains the text you want.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-01-2008, 5:33 PM 47853 in reply to 47852

    Re: Find Inner Text Between Key Words (non HTML or XML)

    mash:

    If you have a way that is working that is fine, but I think it would help you to understand grouping. http://regexadvice.com/blogs/mash/archive/2007/06/01/You_2700_ve-got-your-sub_2D00_matches-in-my-matches.aspx where you can simply match the text that includes your boundary words then just pull out the part you need

    D\d{3}([\s\S]+?)REMARKS

    matches text that includes your keyword boundaries, but group 1  contains the text you want.

     I beleive the OP wants to capture up to the end of the REMARKS line.. not exclude REMARK and anything afterwards..

    Also, if I understand your pattern correctly, it only looks for spaces (any kind) or non spaces one or more consecutive times (lazily)... followed by (but not include) REMARK.. I don't *think* this wil capture what the OP is looking for.


    If preg was a woman, I'd marry her. But I could just see the pattern... she would replace me with someone with money ($1).
  •  11-01-2008, 6:26 PM 47854 in reply to 47853

    Re: Find Inner Text Between Key Words (non HTML or XML)

    nrg_alpha:
    mash:

    If you have a way that is working that is fine, but I think it would help you to understand grouping. http://regexadvice.com/blogs/mash/archive/2007/06/01/You_2700_ve-got-your-sub_2D00_matches-in-my-matches.aspx where you can simply match the text that includes your boundary words then just pull out the part you need

    D\d{3}([\s\S]+?)REMARKS

    matches text that includes your keyword boundaries, but group 1  contains the text you want.

     I beleive the OP wants to capture up to the end of the REMARKS line.. not exclude REMARK and anything afterwards..

    Also, if I understand your pattern correctly, it only looks for spaces (any kind) or non spaces one or more consecutive times (lazily)... followed by (but not include) REMARK.. I don't *think* this wil capture what the OP is looking for.

    D\d{3}(([\s\S]+?)REMARKS(.+))


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-01-2008, 7:33 PM 47856 in reply to 47854

    Re: Find Inner Text Between Key Words (non HTML or XML)

    mash:

    D\d{3}(([\s\S]+?)REMARKS(.+))

    Why not simply include REMARKS within the capture (and only use that one capture)? 

     D\d{3}([\s\S]+?REMARKS.+)

     What I initially came up with was:

    D\d{3}(.+?REMARKS[^\r\n]*)

    But I also had to use the s modifier (in PHP PCRE) so that I can check through newlines as well. I suppose your solution would work, as I am going to assume that the dot (match all) wildcard does not by default match newlines in vbnet? Do you need to add some form of modifier of some sort to be able to grab all / multiple instances between the D\d{3} and the end of REMARKS? Sorry, I'm not familiar with vbnet.


    If preg was a woman, I'd marry her. But I could just see the pattern... she would replace me with someone with money ($1).
  •  11-01-2008, 8:47 PM 47858 in reply to 47856

    Re: Find Inner Text Between Key Words (non HTML or XML)

    You could use one capture but from the OP question I get the impression they want to dissect the results.  First they ask for the text between keyword, then text to the end RESULTS line.  The main thing I was stressing in my first post was learn how to use grouping if that was the case. So I created groups I thought they might want. A lot of new users assume that once they match a string they can only work with the full match. So they process the results even more to get what they want and are not familar with grouping. If it was just about grabbing the text based off the sample you could do it without any groups.

    The full stop character by default in almost all regex engines does not match a new line.  I don't know an engine off the top of my head where matching newlines in the default behavior. Some ,but not all, langauges have an option to change this.  I chose not to use this option as I wanted to use the default behavoir in part of the patten and used the [\s\S] construct when I wanted to match any character including newlines.  You've simply taken the opposite approach using the DOT ALL option of PHP to have the full stop match newlines and then use the character class to match everything but newline or carraige return.  BTW why check for both? Either should do but the newline is really the only one needed.  The only other difference is your method requires an option to be flipped on and my doesn't. 

    As for .Net the Regex object has different methods you can call, one will return all matchs if that's what you want. Doesn't PHP have the same thing?  I'm not a PHP guy, not yet anyway

     


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-01-2008, 9:12 PM 47859 in reply to 47848

    Re: Find Inner Text Between Key Words (non HTML or XML)

    sindizzy:

    I am attempting to find inner text in between two key words for a text file that is not HTML or XML. The text file would have text like so:

    One more thing I would like to add, non HTML or XML files are the best kind of text files to used a regex against.  Depending on what you are trying to do markup language files, HTML in particular, can become problematic work with using regular expressions.  And in most cases the are better tools for working with Markup data.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-01-2008, 10:58 PM 47862 in reply to 47858

    Re: Find Inner Text Between Key Words (non HTML or XML)

     Thanks for filling me in on all that, Mash. If the OP is looking for segmented captures.. then yes, mutiple catpures would be best. 

    mash:

    As for .Net the Regex object has different methods you can call, one will return all matchs if that's what you want. Doesn't PHP have the same thing?

    If I understand what you are asking correctly, php offers different versions of matching... preg_match() will match only the first instance it finds, then it stops.. preg_match_all() will pick up from the last successful match in the string, and start all over again, picking up where it left off, having the potential to utilise one pattern and store as many matches as it can and put those into an array. (which is what I used in this thread's case). As a result, I had two elements in my array given the search pattern I provided.

     Cheers,

    NRG 


    If preg was a woman, I'd marry her. But I could just see the pattern... she would replace me with someone with money ($1).
  •  11-01-2008, 11:16 PM 47863 in reply to 47862

    Re: Find Inner Text Between Key Words (non HTML or XML)

    Yeah .Net is like that too

    There is Regex.Match method and a Regex.Matches.  The former returns a single match, the latter returns a collection of matches.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-02-2008, 10:18 AM 47870 in reply to 47863

    Re: Find Inner Text Between Key Words (non HTML or XML)

    Let me test some of the suggestions noted here. To answer some of the questions yes eventually i want to parse some info out of each match but the grouping part of it i am familiar with and am able to work with no problem. Its the base regexp that I i was having problems with. Right now the most importnat part was the route id and maybe the remarks but later i may want to grab the other data. thanks for the help and will follow up with my testing.

    AGP

     

  •  11-02-2008, 10:53 AM 47871 in reply to 47870

    Re: Find Inner Text Between Key Words (non HTML or XML)

    ok I tried (?s)D\d{3}(?:(?!REMARKS).)*REMARKS and that has the problems I mentioned. It does not include the words after REMARKS and it also falsely captures some points that could be constrcuted like the route. For example if there was a point D300P-D100Q-RUGGED in the route structure.

    this does not work in VB.NET  #D\d{3}(.+?REMARKS[^\r\n]*)#s. I dont think the # is supported.

    D\d{3}([\s\S]+?)REMARKS  also does not include the end of the remarks and captures the points that are similar to the route id.

    D\d{3}(([\s\S]+?)REMARKS(.+))  captures up until the end of the remarks but captures points that are similar to route id.

    D\d{3}([\s\S]+?REMARKS.+)  doesnt work very well. It doesnt capture the entire block but I think its because of the points that are similar to the route id.

    To my original example let me add a point that gives a "false positive"

    D001  0950 X R H                                              
     ROUTE POINTS  STA1  -AAA  -SDR  -AAA -AAA11-BBB  -DDD2-
                   DDD3-CCC-DDEE-RRRR-WWWXX-YYYY-ZZ023-ZZZZZ-
                   STA2  -                                         
                                                                  
     DELIVERY TEXT     SAT1.STOP1.AAA.RT44.BBB.RT4.CCC.STOPX.STA2
                                                                  
     ROUTE TEXT    STA1 AAA.RT44-BBB.RT4-CCC STA2                     
                                                                  
     REMARKS 

    D035                                                
     ROUTE POINTS  WW1-WW2-DTT  -AA2  -BB2  -RRR  -QQQQQ-AAQQ-
                   HHH  -MMMM-NNNN-LLLL-KKKKK-KKKWW-QQAA-
                   D300P-D100Q-RUGGED-BPPPPPASBL-STA2  -                                   
                                                                  
     DELIVERY TEXT     STA1.STOP5..AA2.DCT.BB2.RT100.HHH.STOP7.STA2      
                                                                  
     ROUTE TEXT    STA1 AA2-BB2.RT100-HHH STA2                        
                                                                  
     REMARKS  TRAFFIC AVOIDANCE

     The regexp I have currently D\d{3}[A-Z]{0,1}(.|\s)*?REMARKS(.|\s)*?(?=\r) seems to work but is relatively slow and sometimes crashes in my project. I test in Expresso first and it runs but in my project it takes a long time and sometimes will make the app crash.

    thanks for all the help.

     

    AGP

     

  •  11-02-2008, 12:49 PM 47874 in reply to 47871

    Re: Find Inner Text Between Key Words (non HTML or XML)

    sindizzy:

    this does not work in VB.NET  #D\d{3}(.+?REMARKS[^\r\n]*)#s. I dont think the # is supported.

    As I noted in one of my responses.. the # characters are one of many delimiters that perl compatible regular expression patterns require. You don't use those..

    Examining your pattern, you seem to be over complicating things:

    D\d{3}[A-Z]{0,1}(.|\s)*?REMARKS(.|\s)*?(?=\r)

    a) I'm not sure why a character class of a-z with 0-1 occurances while you can use laze quantifiers to match anything up to REMARKS (.+?). Much easier this way.

    b) Try to avoid using alternations... this is slower than using character classes.. Side note: your (.|\s) is not useful at all.. \s is part of the dotall matching...

    So you have  (.|\s)*?(?=\r) which is basically saying, anything or any whitespace (zero or more times, lazily), all the while using a look ahead assertion to \r.

    What I have done is pretty much this (only I had \n as well..to to revise without using \n.. using a negated character class [^\r]* is much more effiecient than making the regex engine jump through hoops with an inefficient alternation and a look ahead assertion). 

    Therefore, perhaps this would work?

    D\d{3}(.+?REMARKS[^\r]*) 

     

    Cheer,

    NRG 

     


    If preg was a woman, I'd marry her. But I could just see the pattern... she would replace me with someone with money ($1).
  •  11-02-2008, 5:20 PM 47879 in reply to 47871

    Re: Find Inner Text Between Key Words (non HTML or XML)

    sindizzy:

    this does not work in VB.NET  #D\d{3}(.+?REMARKS[^\r\n]*)#s. I dont think the # is supported. 

    sindizzy 

    This in one of the reasons why I tend to post my examples as the raw pattern or replacement text without quotes etc.. It is also one of the reasons why the posting guidelines ask about the regex and programming language that you are using (and I know the OP stated the use of VB.NET).

    Finally it is why the posters to this forum CANNOT simply take what is suggested WITHOUT THINKING.

    The suggested pattern above clearly has the rider that it was PHP code - therefore the '#'s are being used as pattern delimiters. If you didn't understand the pattern that was being suggested, then you should have asked for an explanation or at least indicated where you were getting lost.

    You are right - the '#' is NOT part of the VB.NET syntax but if you really do understand VB.NET and have looked at the Microsoft documentation for the functions that you are using, you would have realised that the functions are expecting a string and VB.NET does not have a string form with leading and trailing '#'s. Therfore you will need to do some work for yourself and convert this into an appropriate form that is acceptable to the VB.NET functions.

    The regexp I have currently D\d{3}[A-Z]{0,1}(.|\s)*?REMARKS(.|\s)*?(?=\r) seems to work but is relatively slow and sometimes crashes in my project. I test in Expresso first and it runs but in my project it takes a long time and sometimes will make the app crash.

    I must admit that I have not investigated this too much, but I suspect the problem is in the '(.|\s)*?(?=\r)' part. What this will do is to match any character (by the way using either the '.' with the 'singleline' option or the [\S\s] character set will do the same thing as '(.|\s)') and then look forward for a carriage-return character. If the 'REMARKS' keyword is at the end of a line then it will be immediately followed by a line terminator (and as this is a Windows platform it may well be a '\r\n' pair unless this has been converted into '\n' by something else). I'm guessing that the regex engine will try to match the '(.|]s)' subpattern which will succeed on the '\r' character. Because you have made this a lazy  quantifier it will then check for the lookahead which will fail. It will then go back to the '(.|\s)' part and try again, matching the '\n'. It will keep doing this until it finally does match a character that is immediately followed by a '\r'. However, in doing so it will have left behind many saved states so that it could backtrack - this may cause your program to crash when it runs out of stack space.

    On the other hand, if it never finds the '\r', then the regex engine will start to backtrack, one character at a time, until it has removed ALL of the matches it has made - only then will it realise that a match count of 0 is what is needed in this case and so will declare the final 'success'. However, this will take a long time.

    Finally, the catastrophic part may well be the interplay between the 2 instances of this subpattern on either side of the 'REMARKS' keyword. This can lead to an exponential explosion of combinations that the regex engine must check.

    I would suggest something like (untested):

    ((?!REMARKS)[\S\s])*REMARKS[^\r]*

    as an alternative - this should never need to backtrack as each step forward is deterministic. 

    I bet that when you are using Expresso you are using a small test text but in your program you have a much longer text to scan. You need to look at the timing information that Expresso is providing you and also provide positive, negative and borderline test cases.

    Susan 

Page 1 of 2 (25 items)   1 2 Next >
View as RSS news feed in XML