Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Repeated chunk of characters

Last post 08-29-2008, 9:45 AM by ddrudik. 20 replies.
Page 1 of 2 (21 items)   1 2 Next >
Sort Posts: Previous Next
  •  08-25-2008, 10:54 PM 45630

    Repeated chunk of characters

    Im programming in VB.NET 2005 and I have a huge string of characters where I want to get certain info. The string may have rpeateed chunks of data but im only interested in a very specific part. So for example the data could have the following:

     251824 08/380 ZZZ THIS IS WORTHLESS TEXT
     0808251645-0808251830
    251824 08/381 ZZZ NOT WORTH ANYTHING
     0808251645-0808251830
    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

    The importnat data that I would like to have the regexp return is

    08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

    and there may be more than one of these matches in the string. Right now myn regexp returns matches that i do not require and I almost have it but not quite. The data could have intermediate info like

     08/027 ZZZ blah blah blah DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

    but the other parts of the string like

    08/380 ZZZ THIS IS WORTHLESS TEXT and 08/381 ZZZ NOT WORTH ANYTHING

    are throwing it for a loop.

     here is my current regexp

    "([0-9]{2}/[0-9]{3})\s(\w{3})(.*?)(DATA CRITICAL (([0-9]+)) NM RADIUS AT ([0-9]{5,6}[NS])/([0-9]{5,7}[EW]))"

     Any help is appreciated.

    AGP

     

  •  08-25-2008, 11:36 PM 45631 in reply to 45630

    Re: Repeated chunk of characters

    You are going to have to give us some way of differentiating "ZZZ blah blah blah DATA CRITICAL" from "ZZZ DATA CRITICAL". If the ZZZ is real data then match it exactly. If there are restrictions on what can be between the "ZZZ" and the "CRITICAL DATA" then we will need to know: only alphas, no special characters, specific phrases etc.

    By the way, your pattern as it stands does not match anything, but if you change the "NM RADIUS" to "M RADIUS" then it does.

    Susan

  •  08-26-2008, 12:52 PM 45660 in reply to 45631

    Re: Repeated chunk of characters

    The ZZZ can be any three alpha characters so it can be ERT, DFR, TYO, etc. between that and the DAT CRITICAL can be some other words or prases vut the meat of the message is always

    08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W where I am assured the following

    ##/### AAA <...possibly some words..>  DATA CRITICAL ### M RADIUS AT ######[N or S]/#######[W or E]

    Sorry about the typo but the original regexp does have the M RADIUS. Im trying to avoid matching other areas of the mesasge where i see the ###/### AAA <...text not including the above..>

    So out of the message

     251824 08/380 ZZZ THIS IS WORTHLESS TEXT
     0808251645-0808251830
    251824 08/381 ZZZ NOT WORTH ANYTHING
     0808251645-0808251830
    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W 

     I want to match any ocurrences of the type of line

     08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W 

    Thanks

     AGP

     

  •  08-26-2008, 2:28 PM 45663 in reply to 45660

    Re: Repeated chunk of characters

    \d\d/\d{3} [A-Z]{3}\b.*?\bDATA CRITICAL \d{3} M RADIUS AT \d{6}[NS]/\d{7}[WE]
  •  08-26-2008, 6:41 PM 45668 in reply to 45663

    Re: Repeated chunk of characters

    yes a variation of what i had and it works for the most part except that as its first match it picks up

     251824 08/380 ZZZ THIS IS WORTHLESS TEXT
     0808251645-0808251830
    251824 08/381 ZZZ NOT WORTH ANYTHING
     0808251645-0808251830
    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W  

    In other words the entire string. probably because of the initial "08/380 ZZZ THIS". the only strings I wanted to match was the one in bold. But again I have the challenge that there could be some other words between ZZZ and DATA CRITICAL 305 M RADIUS AT 372419N/1154323W. So in other words I want to match

    08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W 

    08/035 ZAA RESPOND OTS DATA CRITICAL 25 M RADIUS AT 3725500N/1152350W  

    07/044 ZBB INDETERMINATE USER NO RESPONSE DATA CRITICAL 9 M RADIUS AT 3725500N/1152350W   

    etc

    but strings like

    08/380 ZZZ THIS IS WORTHLESS TEXT 

    08/381 ZZZ NOT WORTH ANYTHING

    do not need to be parsed and are essentially worthless to the analysis. As it is right now when i parse the message the first match i get back is the entire message. Sometimes that is the only match i get. Im going to try some variations and report back what works. Any help is still appreciated.

     

    AGP

     

  •  08-26-2008, 7:17 PM 45669 in reply to 45668

    Re: Repeated chunk of characters

    I would recommend you turn off singleline if that is an option for your platform:

    Raw Match Pattern:
    \d\d/\d{3} [A-Z]{3}\b.*?\bDATA CRITICAL \d{3} M RADIUS AT \d{6}[NS]/\d{7}[WE]

    $matches Array:
    (
        [0] => Array
            (
                [0] => 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W
            )

    )

    you could also try to disable it in the pattern:

    (?-s)\d\d/\d{3} [A-Z]{3}\b.*?\bDATA CRITICAL \d{3} M RADIUS AT \d{6}[NS]/\d{7}[WE]


  •  08-26-2008, 7:21 PM 45670 in reply to 45668

    Re: Repeated chunk of characters

    Do you by any chance have the 'single line' option set? DON'T. ddrudiks pattern works exactly as you have asked UNLESS the 'singleline' option is set when you will get the behaviour you describe.

    For my testing I used 'expresso' which is .NET based (same as you say you are using) with no options set (although you may want the 'ignore case' if appropriate.

    Susan

    Edit: Doubled with Doug

  •  08-26-2008, 7:40 PM 45671 in reply to 45670

    Re: Repeated chunk of characters

    I suppose another option is:

    \d\d/\d{3} [A-Z]{3}\b[^\n]*?\bDATA CRITICAL \d{3} M RADIUS AT \d{6}[NS]/\d{7}[WE]


  •  08-26-2008, 11:45 PM 45680 in reply to 45671

    Re: Repeated chunk of characters

    I have no options set on my regexp. And when i download Expresson and pu the regexp and text in the boxes I get zero matches

    \d\d/\d{3} [A-Z]{3}\b[^\n]*?\bDATA CRITICAL \d{3} M RADIUS AT \d{6}[NS]/\d{7}[WE]

      251824 08/380 ZZZ THIS IS WORTHLESS TEXT
     0808251645-0808251830
    251824 08/381 ZZZ NOT WORTH ANYTHING
     0808251645-0808251830
    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

    I must be doing something wrong. Do you have an expresso project that i can use to leatrn cause im at a bit of a loss here. I will experiment with expresso and see if i can get this to help me to my error.

    <edit: never mind...i realized that since my regexp had no options then i should also set that the same in the Design Mode. Anyway i will continue my testing.>

     

    AGP

     

     

     

  •  08-27-2008, 12:36 AM 45684 in reply to 45680

    Re: Repeated chunk of characters

    ok i think i found one of my problems. Since the data i receive could run over into another line, what i have done in order to get all matches possible i have cleaned the message into one long string. So instead of the message being

       251824 08/380 ZZZ THIS IS WORTHLESS TEXT
     0808251645-0808251830
    251824 08/381 ZZZ NOT WORTH ANYTHING
     0808251645-0808251830
    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

    Its actually 

     251824 08/380 ZZZ THIS IS WORTHLESS TEXT 0808251645-0808251830 251824 08/381 ZZZ NOT WORTH ANYTHING 0808251645-0808251830 181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W 

    Now if I use that regexp then it returns the only match as

    08/380 ZZZ THIS IS WORTHLESS TEXT 0808251645-0808251830 251824 08/381 ZZZ NOT WORTH ANYTHING 0808251645-0808251830 181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

    So here is my question...am i doing the right thing by cleaning the string first (take out unreadable characters, line feeds, and making multiple spaces as single)? or is there a better way to do it directly with the untouched data. Sorry about the confusion as I forgot that in between reading the message and sending it to the parser i did the cleaning but the message could come across as

    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS
    AT 372419N/1154323W

    or

    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS  
      AT 372419N/1154323W

    The extra space at the end and the beginning of the line is either accidentally put there or is done by the system before it gets sent to me. Thus, that is why I clean the string and make it one big string so that i am assured that all multiple spaces are truncated to one and all other white space is replaced witha  null string. I hope that makes more sense.

    btw, thanks for suggesting Expresso. Its saving me tons of time with testing.

    AGP

  •  08-27-2008, 3:03 AM 45685 in reply to 45684

    Re: Repeated chunk of characters

    The basic answer to your question really depends on your understanding of the raw data, what your are trying to do and which makes the task the most reliable.

    However, given the fact that the text could be a single line (for could be treated as such) then try:

    \d\d/\d{3} [A-Z]{3}\b((?!\d\d/\d{3} [A-Z]{3}\b).)*?\bDATA CRITICAL \d{3} M RADIUS AT \d{6}[NS]/\d{7}[WE]

    Basically this is the same as the previous suggestion except that the '.*?' has been replaecd by'((?!\d\d/\d{3} [A-Z]{3}\b).)*?' which works its way forward a character at a time and tests to see if the start of the pattern can be seen again - if it can then fail the whole match and restart from further down the text string. This should mean that the 'DATA CRITICAL' text shuld be matched with the ZZZ (whatever) stuff that is the closest to it.

    If I were you, I would step back from the computer, grab a piece of paper and work out the fundamental aspects of the text you are looking for. Call them 'beginning' 'stuff to skip', 'key phrase' or whatever so that your thinking can be simplified and not tied up with the details of each component. You can then 'think like a computer' and step your way through the text (of whatever form) and see what decisions need to be made at each step and so develop the set of 'rules' that means you make the matches you want and not the ones you don't. Once your have the basic structure, you can then go back and fill in the details (exactly what does go to make up the 'beginning' phrase etc.).

    In effect this means that you need to create a state diagram of what you are trying to do, do all of your thinking with that (also by testing with situations that are close to being right but are actually wrong, and vice versa - borderline cases are typically the best/worst tests) and then create the pattern.

    Susan

     

  •  08-27-2008, 11:42 AM 45702 in reply to 45685

    Re: Repeated chunk of characters

    Susan

    Your guidance is greatly appreciated. I came in this morning and tried your suggestion with a couple of modifications and it worked and is exactly what i needed to finish the regexp.

    In answer to your comments I am very familair with the data and last year I parsed similar data and came to the conclusion that without cleaning the string first I would miss many matches. I di in fact start with a piece of paper and wrote down the data that I wanted to parse out. i worked my way from the simple parts towards the more difficlut which in this case semmed to be the start of the string. Thats where i got stuck. But the forum led me to better understand what I was doing and now with the Expresso tool I am even more enlightened. Once I tested your suggestion I in fact went further and grouped many of the pieces that I needed and added some more phrases (easier stuff) and again used Expresso. For example

    DATA CRITICAL (?<radius>\d{3}) M RADIUS

    It works beautifully and now i am just going to take some time to fully understand the structure of your suggested regexp.

    Again you and the forum have been extremely helpful and i couldnt have progressed without the help.

    AGP

  •  08-28-2008, 11:33 AM 45751 in reply to 45702

    Re: Repeated chunk of characters

    As luck would have it I put the regexp through a real world test and ran it through a large amount of messages. In my process I highlight small phrases in red in a rich textbox so that I know where to look for my match. Then in my actual matching routine i do a more rigorous regexp to get the full data string. Like i said it works great but I found out that there may be another format for the message and that's where im at right now. The message could have another data line that is valid but looks slightly different:

     280623 08/634 ZZZ AREA GLADTRY 1 MTW 5000-17999 EFF
     0808280200-0808280650
    280623 08/635 ZZZ AREA BAGTY 4 MTQ 5000-17999 EFF
     0808280200-0808280650
    271815

    181225 08/034 PZZZ EFF 0808192100 - 0808249899
     2100 - 0900 EVERY DAY
     MSG SENT DATA CRITICAL
     295 M RADIUS AT 330107N/1061726W
    AT 400, DECRBY 200

        251824 08/380 ZZZ THIS IS WORTHLESS TEXT
     0808251645-0808251830
    251824 08/381 ZZZ NOT WORTH ANYTHING
     0808251645-0808251830
    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

    I experminted with several variations but here is what i am using as a first stab

    \d\d/\d{3} [A-Z]{3,4} (.{0,30})\b.*?\bDATA CRITICAL \d{3} M RADIUS AT \d{6}[NS]/\d{7}[WE]

    The .{0,30} is just a first pass to see if I could pick up some intermediate phrases. The main thing is that depending on the source the main identifier could have 3 o4 characters and thus i changed it to [A-Z]{3-4}. But im getting some false positives with other lines. I think Im close but I still having some issues with the regexp. Again the help from this forum is invaluable and much appreciated.

     

  •  08-28-2008, 1:12 PM 45752 in reply to 45751

    Re: Repeated chunk of characters

    This fits your sample, if there's an issue provide a larger sample:

    \d\d/\d{3}\s+[A-Z]{3,4}\b(?:(?!\d\d/\d{3}).)*?\s+DATA\s+CRITICAL\s+\d{3}\s+M\s+RADIUS\s+AT\s+\d{6}[NS]/\d{7}[WE]

    Note the \s+ in case the string word wraps in different areas.


  •  08-28-2008, 2:51 PM 45760 in reply to 45752

    Re: Repeated chunk of characters

    yeah i think that is working for me. There is another possibility which i think I made the right modifications

     280623 08/634 ZZZ AREA GLADTRY 1 MTW 5000-17999 EFF
     0808280200-0808280650
    280623 08/635 ZZZ AREA BAGTY 4 MTQ 5000-17999 EFF
     0808280200-0808280650
    271815

    181225 08/031 KFFTAXXX PZZW EFF 0808192100 - 0808249899
    MSG SENT DATA CRITICAL
    302 M RADIUS AT 330107N/1061726W
    AT 200, USERID2

    181225 08/034 PZZZ EFF 0808192100 - 0808249899
     2100 - 0900 EVERY DAY
     MSG SENT DATA CRITICAL
     295 M RADIUS AT 330107N/1061726W
    AT 400, DECRBY 200

        251824 08/380 ZZZ THIS IS WORTHLESS TEXT
     0808251645-0808251830
    251824 08/381 ZZZ NOT WORTH ANYTHING
     0808251645-0808251830
    181225 08/027 ZZZ DATA CRITICAL 305 M RADIUS AT 372419N/1154323W

     On the first one there could possibly be an 8 character (as far as I know right now) word. I use this instead

     \d\d/\d{3}\s+(?:(?!\d\d/\d{3}).)*?\s+[A-Z]{3,4}\b(?:(?!\d\d/\d{3}).)*?\s+DATA\s+CRITICAL\s+\d{3}\s+M\s+RADIUS\s+AT\s+\d{6}[NS]/\d{7}[WE]

    I tested  it and its gives me what i need but I have a feeling that I'm not doing that 8 character bit part right. I assumed that since this was the way to skip unwanted text from the 3 or 4 letter identifier to the DATA CRITICAL then this would probably work with the 8 character word that is irrelevant. But now that I'm testing it I think it is wrong because when i try to group the 3 or 4 letter identifier I do get some erroneous identifiers. In my case the identifiers would be PZZW, PZZZ, and ZZZ. In other words right after the message number like 08/027 the first 3 or 4 letter word that is encountered is the identifier. Ill work on this some more but thanks to the forum Im pretty close.

    AGP

     

     

Page 1 of 2 (21 items)   1 2 Next >
View as RSS news feed in XML