Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Matching multiple values out of a large String

Last post 04-04-2009, 7:26 PM by Aussie Susan. 1 replies.
Sort Posts: Previous Next
  •  04-03-2009, 9:10 PM 51964

    Matching multiple values out of a large String

    Hello,
    sometimes I extract several values out of large HTML-Files/Strings. Now i wonder which is better for performance: Put everything in one regular expression, or make one matching per value.

    For example: I have a a huge string which has values for "langid=", "date=", and "userid=". These values are anywhere in the huge string. Now what is more performant on PCRE's:

    preg_match('/langid=(\d+).*?date=([0-9. ]+).*?userid=(\d+)/', $largestring, $results); // METHOD 1

    or one preg match for every value (METHOD 2):

    preg_match('/langid=(\d+)/', $largestring, $lang);
    preg_match('/date=([0-9. ]+)/', $largestring, $date);
    preg_match('/userid=(\d+)/', $largestring, $user);

     

    With Method 1 everything gets extracted in "one turn", but the internal steps are doubled because of the backtracking (if i see it correctly), and with Method 2 the engine starts from the beginning everytime. Is this right? What is more performant in your eyes, and why?

  •  04-04-2009, 7:26 PM 51968 in reply to 51964

    Re: Matching multiple values out of a large String

    The real answer depends on so many other factors that it is really hard to say without you performing some tests and seeing which provides acceptable (whatever that means to you) performance. If this is something that you do hundreds of times a second then you may end up with a very answer than if you were doing this once a day.

    The first pattern requires that the order of the "langid", "date" and "userid" keywords is always the same and it will get confused with text such as

    abc langid=123 userid=456 def langid=789 date=876 userid=432

    in that the initial "langid" value will match "123 userid=456 def langid=789"

    The second pattern assumes that all 3 keywords are always given otherwise you will have problems if you are trying to line up the corresponding values in the resulting arrays. However it does have the advantage that the keywords can occur in any order.

    In both cases, the initial scanning to find the start of a keyword should be reasonable (approximately linear with the length of the text) in that each starts with literal characters that the regex engine can easily check for a match. Once it finds the potential start of a keyword, then both will require the same amount of time to check that the text really is the required keyword.

    Susan 

View as RSS news feed in XML