Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

need help finding regexp

Last post 02-08-2012, 6:52 PM by Aussie Susan. 3 replies.
Sort Posts: Previous Next
  •  02-06-2012, 12:59 PM 84611

    need help finding regexp

    Hi,

    here is the text that I have, it's a chat log file, I replaced real texts that people usually type in a chat with "random text #":

    2012-02-06 12:20:53 [Team] [me] ! random text 1
    2012-02-06 12:21:37 [Local] [player3] random text 2
    2012-02-06 12:21:52 [Team] [me] ! random text 3
    2012-02-06 12:21:56 [Local] [] Your position is: (121, 911)
    2012-02-06 12:22:33 [Local] [] random text 4
    2012-02-06 12:22:33 [Local] [] Your position is: (471, 901)
    2012-02-06 12:22:36 [Team] [me] random text 5
    2012-02-06 12:23:14 [Local] [] random text 6
    2012-02-06 12:23:57 [Local] [player12] random text 7
    2012-02-06 12:24:54 [Local] []  random text 8
    2012-02-06 12:25:10 [Team] [me] ! random text 9
    2012-02-06 12:25:14 [Local] [] Your position is: (160, 909)
    2012-02-06 12:26:07 [Local] [] random text 10
    2012-02-06 12:26:25 [Team] [me] ! random text 11
    2012-02-06 12:26:27 [Local] [] Your position is: (10, 9077)
    2012-02-06 12:26:57 [Local] [player6] random text 12
    2012-02-06 12:27:15 [Team] [me] ! random text 13
    2012-02-06 12:28:11 [Local] [] random text 14
    2012-02-06 12:28:17 [Local] [] Your position is: (197, 90627)
    2012-02-06 12:29:14 [Team] [me] ! random text 15
    2012-02-06 12:29:15 [Local] [] Your current position is: (92, 114)
    2012-02-06 12:29:32 [Team] [me] ! random text 16
    2012-02-06 15:10:31 [Team] [] random text 17

    I'd like to capture all my lines that I start with "!" and that are followed by at least one position message before my next line that starts with "!", also capture the coordinates in the position message.

    So in the text above I'd like to capture:

    random text 3, 121, 911
    random text 9, 160, 909
    random text 11, 10, 9077
    random text 13, 197, 90627
    random text 15, 92, 114

    I was going to capture all texts between my "!" lines first, and then do another search for the presence of position messages in them, so I came up with this:

    /\n\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\[(?:team|local)\] \[me\] !(.+)/igs

    but it doesn't work, it captures the whole text and doesn't see repetitions of the pattern inside it, and I don't know how to do it, so I need help ;-(

    I'm using a language based on ECMA-262

  •  02-06-2012, 5:29 PM 84612 in reply to 84611

    Re: need help finding regexp

    There are several things you can do to improve your pattern.

    The first is that the pattern you have shown will need a space after the last date digit and before the "team" or "local" key work within the square brackets.

    The problem you are getting of matching too much text is due to the use of the '(.+)' construct with the "s" (singleline) option. With the single line option set, '.' will match newline characters as well as the other characters that '.' normally matches . The effect of this is that there is nothing to stop this part of the pattern matching except the end of the text. If you take off the 'singleline' option, then you will correctly match to the end of the "current" line. You then need to add something to your pattern to make it move to the next line and match the numbers in parentheses. Something like:

    \n\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}\s+\[(?:team|local)\]\s+\[me\]\s+!(.+)
    \n\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2}\s+\[(?:team|local)\]\s+\[\]\s+your\s+(current\s+)?position\sis:\s+
    \((\d+),\s+(\d+)\)

    (I know this does not account for the "random text 13" case but this is just to show the idea).

    If I were doing this, my pattern would be something like:

    !(.+)
    (\n((?!!).)*$)*?
    \n((?!\bposition\b).)*position
    \s+is:\s+\((\d+),\s+(\d+)\)

    with the "ignore case" and "multiline" options set. (If you split it over several lines like this, then you will also need the "ignore whitespace" or equivalent option set as well). The text you want will be captured in match group #1, and the position values in match groups #5 and #6. From those, you can build the output string you are looking for.

    The first line finds the users text. The second line will skip over any lines that do not contain a "!" (such as your "random text 14" line) but possibly skip no lines at all. The 3rd line will only match a line with the keyword "position" in it and the 4th line will then capture the position coordinates.

    I must admit that I don't know if the ECMA-262 standard supports some of the constructs I've used such as the lazy quantifiers. Also this is just one way you can approach this situation. I must admit that I probably don't understand your problem space fully and I may not have correctly accounted for everything you may encounter. However it might give you some ideas of how to move forward.

    The  key thing to remember in constructing a pattern is that you need to account for every possible character that is between the first and the last characters you want to include in the match. Looking at your pattern, you spend quite a bit of effort making sure that the line begins with the date/time stamp etc, where as all you really want to look for is the "!" character. (I might have misunderstood your complete problem here in which case you may need to expand this a bit to ensure - for example - that the "!" that is used for the start of the pattern must be preceded by the "] " sequence etc.).

    Once you have started a match, then you must account for every character until you get the the last possible character. In your pattern you have not really told the pattern where to stop - there is nothing to locate the "(xxx,yyy)" position text etc. Also if you do take away the "singleline" option, then there is nothing that will match the additional text lines before the "position" line.

    Susan

  •  02-07-2012, 11:33 AM 84616 in reply to 84612

    Re: need help finding regexp

    Thanks! The idea of putting a few lines one by one into regexp helped a lot.

    The language I'm using is slightly different though, \s matches not only space but new line and tabs as well, so when I need space I just type it in regexp.

    Still, I spent an hour trying to put two lines "together" in regexp when I remembered that I'm reading the data from the file, so every line ends with \r\n, without it nothing works, $\n will not match the end of a line and the beginning of a new one.

    Finally I ended up with this, it does the trick (I added more info into the chat messages between "position is:" and the coordinates):

     /\n(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?:team|local)\] \[me\] !(.+)(?:\r\n\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[(?:team|local)\] (?!\[me\] !).+)*?(?:\r\n\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[(?:team|local)\] \[\] Your current position is: (.+) \((\d+), (\d+).+)/ig

    Thanks again!

  •  02-08-2012, 6:52 PM 84620 in reply to 84616

    Re: need help finding regexp

    Just a couple of explanations for you.

    The reason '$\n' will not match the end of one line and the beginning of another is 3-fold. The way the majority of regex variants work is to thread the '\r' as a white space character and nothing "special" in terms of line breaks. I quite often use '\r?\n' in this situation as some environments terminate a line with just '\n' and others with '\r\n'. Next the '$' operator is what is called a "zero-width anchor": the "zero-width" means that it will check that the next character is a line terminator BUT it will not advance the character pointer. Therefore the '\n' after it is still necessary to actually match the line terminator but requires that the terminator is exactly the "\r" character.

     Finally, regexes generally work in terms of a single line break, and not end of line/start of line. Having said that there are 2 anchors '^' and '$' which match the start and end of the complete text or, if the "multiline" option is set, the start and end of each line. In effect, '^' is a lookbehind operator to see if the preceding character is the line terminator, whereas '$' is a lookahead operator to see if the next character is the line terminator. (They also take the special case of the start and end of the text into account)

    As for my use of the '\s+' operator, I find it easier to use (and see) in a regex pattern for 2 reasons: the first is that it can be hard to tell what is a space or what is a tab that only expands to a space in the text being processed, and a space character in the pattern will only match an explicit space character in the text; and the second is that you can miss multiple space characters is a pattern whereas the '\s' is easier to see. I know there is the possibility of the '\s' also matching the line terminators etc, but this is usually a small price to pay and can allow some additional flexibility (especially when some text can include line breaks anywhere a space character can occur).

    Glad you have something that works - that is the main thing.

    Susan

View as RSS news feed in XML