Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Performance and more Performance...

Last post 11-18-2006, 1:01 PM by sirpadk. 12 replies.
Sort Posts: Previous Next
  •  03-13-2006, 11:19 AM 15742

    Performance and more Performance...

    Hi all,

    Just came across this site, and immediately joined these forums.

    Surprising that there is not a single post in this category. I must be the only one crazy enuf about this ! Surprise [:O]

    I am familiar with drafting Regular expressions for my needs but wanted to understand how I can make them more efficient. I did search a lot for pointers on writing highly performant Regexes, but could find very less information. You can find thousands of sample Regexes for every need, but is anyone talking about which one searches faster.

    I understand that the difference comes in only when parsing rather large files, but that's the type of files I usually get saddled with. Hence my concern.

    So, I would appreciate any pointers on writing efficient Regexes, increasing their performance and minimizing Backtracking etc. Links to useful articles are also highly appreciated.

    Lastly, does anyone know of any tools out there that measure the comparative Performance of different Regular expressions. At present, I am using RegexBuddy's Debugger to elicit such information.

    Thanks and Regards,

    Cerebrus.

  •  03-25-2006, 5:12 PM 15974 in reply to 15742

    Re: Performance and more Performance...

    I'm very concerned about this stuff as well (I have some complex regexes that run over every page on faa.gov, so they need to be lightning fast...ideally taking less than 1 millisecond), but I'm not a guru on the subject. One of the best ways to minimize backtracking is of course appropriate use of atomic groupings (a feature which goddamn ColdFusion doesn't support). A small thing that I'm religious about is using noncapturing groups [(?:) vs. ()] whenever I don't need a grouping as a backreference, which will run just a tiny bit faster (especially if you're grouping large blocks of text). Other things are e.g. starting a pattern with a literal character or anchor instead of a lookaround whenever possible.

    An example of atomic grouping:

    The following matches a complete HTML file, but performance will be terrible on HTML files that miss some of the tags (and thus won't be matched by this regular expression):
    <html>.*?<head>.*?<title>.*?</title>.*?</head>.*?<body[^>]*>.*?</body>.*?</html>

    You can greatly improve the performance on invalid HTML files through the use of atomic grouping as follows:
    <html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>

    I too use RegexBuddy's debugger to elicit info on the number of steps regexes take to match/fail text, and while actual time measurements would be cool (I get this info through ColdFusion server debugging info), I think the info RegexBuddy offers is more helpful anyway for determining comparitive optimization between different approaches.

    Naturally, a regex that takes the same about of steps regardless of the length of the text it's searching is always preferred, and a regex who's steps increase linearly with text length is always preferred over a regex who's steps grow exponentially.

    My regex-centric blog :: JavaScript regex tester
  •  04-12-2006, 6:26 AM 16381 in reply to 15974

    Re: Performance and more Performance...

    Thanks for your post, Steve. Sorry for the late response.

    You have provided some useful pointers, that I can summarize as (and I add some of my

    own):

    1. Use Non-capturing groups. Alternatively, use the "ExplicitCapture" Regex option.

    2. Prevent Backtracking by using "Atomic groups" and/or "Possessive Quantifiers". The latter are however, unsupported in .NET Sad [:(]

    3. When there is a choice, construct Regexes that fail faster, if not succeed faster.

    4. Use the dot carefully, since it can match anything !

    :

    :

    Am still working on the list. Hope the experts like Jan Goyvaerts and Sergei Z. can add something. Big Smile [:D]

    I am always looking out for articles I can find on the web, which will guide me to take my Regex knowledge to the next level.

    Warm Regards,

    Cerebrus.

  •  11-05-2006, 11:05 AM 23880 in reply to 15974

    Re: Performance and more Performance...

    Hallo guys,

    I'm new is this forum, but I'm working with regex daily (i workin a compony that collect information from the web) and I know a thing or two abourt regex, and performence.

    my tips are:

    • always use small regex, that zoom in to the desired text in stages. that means, that if you have a big HTML page and you want to allocate prices from it, isolate the price info area and then use a scond regex to get all the prices.
    • count how many times you used .*? in you code, if it's more than once your regex can be slow to execute. try to under stand that a.*?b.*?c.*?d reg ex in a text like this abcabcabcabc.... (time this sequence in 10000) will give the regex object to much optional matches that will only fail with the lack of d.
    • instead of .*?, use a negetive definition of what you don't want to match. eg HTML tag can include many attributes, so instead of saying <td.*?> say <td[^>]*>

    that is my 10 cents


    "When the only tool you have is an hammer, everything tends to look like nails" A. Maslow
  •  11-05-2006, 12:37 PM 23881 in reply to 23880

    Re: Performance and more Performance...

    hi, sirpadk,

    welcome to the forum!

    pls explain one more tine why is it that

    <td.*?>

    would peform more efficient than 

    <td[^>]*>

    Thanks

    Sergei

  •  11-05-2006, 12:38 PM 23882 in reply to 23881

    Re: Performance and more Performance...

    sorry: wanted to say:

    pls explain one more time why is it that

    <td[^>]*>

    would peform more efficiently than 

    <td.*?>

  •  11-06-2006, 7:36 AM 23894 in reply to 23882

    Re: Performance and more Performance...

    Hallo Sergei Z

    It's very hard to explain without a white board, but i'll try my best.

     when we say .*? than we accept anything. if our source is the following

    <td class="cool" ></td><td class="cool" ></td><td class="cool" ></td><td class="cool" ></td>

    and the regex is <td.*?>Sergei Z</td> then the regex will scan the following

    <td class="cool" > No match!

    <td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

     

    while using <td[^>]*>  will result in

    <td class="cool" > No match!

    <td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

    <td class="cool" ></td><td class="cool" ></td><td class="cool" ></td><td class="cool" > No match!

     

    now lets say you use .*? 4 times in the regex, that's mean 4 power 4 power optional matches, that will fail at the end, that can be long precess.


    "When the only tool you have is an hammer, everything tends to look like nails" A. Maslow
  •  11-06-2006, 2:37 PM 23909 in reply to 23894

    Re: Performance and more Performance...

    what do u mean by:

    <td class="cool" > No match!

    if we use regex

    <td.*?>

    sure there will be a match on

     <td class="cool" >

     We r still not on the same page I guess. Pls reply if u have time to spare.

  •  11-06-2006, 2:50 PM 23912 in reply to 23909

    Re: Performance and more Performance...

    read what I wrote

    I said "...and the regex is <td.*?>Sergei Z</td> then the regex will scan the following..."

    ofcourse that <td.*?>  is not an issue, but you will rarely, if ever look for <td.*?> only.


    "When the only tool you have is an hammer, everything tends to look like nails" A. Maslow
  •  11-06-2006, 3:16 PM 23915 in reply to 23912

    Re: Performance and more Performance...

    patience man, ;=), i got that part.

    now u need to explain to me why the regex

    <td.*?>

    will make TWO passes over the text :

    <Quoting U>

    and the regex is <td.*?>Sergei Z</td> then the regex will scan the following

    <td class="cool" > No match!

    <td class="cool" ></td><td class="cool" > No match!

    </Quoting U>

    if that's what u mean by bolding here. I'm not quite convinced that's the case: so why 2 passes? Doesn't the search index go through the text

    <td class="cool" ></td>

    w/o matching and then happilly marches on? why do u think the engine reverts to the beginning of the input?

  •  11-06-2006, 3:38 PM 23917 in reply to 23915

    Re: Performance and more Performance...

    Damn where is my white board! [:'(]

    The thing is that we say .*?, that means also >

    if the text is <td class="cool" ></td><td class="cool" >  and the regex is <td.*?> then we have 2 matchs. if the regex is <td.*?>a than we have a semi match which is: <td; after this semi match the regex object will try to find any combination of any char, 0 times or more, which ends with >, we have 3 times > after the first <td, and 2 times after the second <td, so five times the object will look for the char a, after those 5 optional matches. if wee had used <td[^>]> than we have only 2 candidates, and not five. that means that the object will look only 2 times for the a char.

    that's more than 50% save in a little text, take it and multiple it by 10000 tags.


    "When the only tool you have is an hammer, everything tends to look like nails" A. Maslow
  •  11-18-2006, 9:19 AM 24331 in reply to 23917

    Re: Performance and more Performance...

    sirpadk,

    sorry for not replying. I got lost in your explanations. My fault. Thanks.

    maybe we can return to the subject some other time.

    See ya.

  •  11-18-2006, 1:01 PM 24344 in reply to 24331

    Re: Performance and more Performance...

    Yes, this one must have a white board.

    maybe I'll make an animation that explain it better.


    "When the only tool you have is an hammer, everything tends to look like nails" A. Maslow
View as RSS news feed in XML