Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Exceution time of Regex

Last post 02-03-2010, 5:34 PM by Aussie Susan. 1 replies.
Sort Posts: Previous Next
  •  02-03-2010, 1:23 AM 59202

    Exceution time of Regex

    Hi,

    To extract some matter we have created below two regex's :

    Regex 1:

    (\[|\(|\{|\s*)\s*(Spew|space)\s*Above\s*This\s*(line|Um|Use)\s*for\s*(Reacaling|recording)\s*(Date|Data|Dotal|daly)\s*(\]|\)|\s*).{0,150}\s*(MIN|M1N)\s*(\:|\.|\s*)\s*(\d+|\d+\-\d+\-\d{1})\s*.{0,50}\s*Case\s*(Number|NO|s*)\s*(\.|\:|\s*)\s*.{0,20}\s*\d{3}\s*\-\s*\d{7}\s*\-\s*\d{3}\s*.+\w+

    Regex 2:

    (?<=(\[|\(|\{|\s*)\s*(Spew|space)\s*Above\s*This\s*(line|Um|Use)\s*for\s*(Reacaling|recording)\s*(Date|Data|Dotal|daly)\s*(\]|\)|\s*).{0,150}\s*(MIN|M1N)\s*(\:|\.|\s*)\s*(\d+|\d+\-\d+\-\d{1})\s*.{0,50}\s*Case\s*(Number|NO|s*)\s*(\.|\:|\s*)\s*.{0,20}\s*\d{3}\s*\-\s*\d{7}\s*\-\s*\d{3}\s*.+)\w+

    Difference in two regex is only that in the Regex 2, we have used (?<=) from start to end just before last \w+.

    Now when we are testing it in our Regex tester (which has been created in .net windows application, similar to regex tester present on regexlib.com), with Regex1 we are getting results instantenously. But with Regex 2 (which merely contains ?<= additionally) program almost gets hanged). Can you suggest reason for this, is the natural behavior when we use (?<=) or (?=)  

    Our source data from which we have to extract matching string to above regex is  given below.

    OFFICIAL RECORDS OF
    MARICOPA COUNTY RECORDER
    HELEN PURCELL
    120091163962 12/21/2009 10:04
    ELECTRONIC RECORDING
    9027266-11-2-2--
    YORKM
    FIDELITY NATIONAL TITLE
    AFTER RECORDING RETURN TO:
    ACADEMY MORTGAGE CORPORATION
    1218 EAST 7800 SOUTH, SUITE 100
    SANDY, UT 84094
    AT N:CLOSING DEPARTMENT
    (801) 233-3700
    R-'0 CP( 7)66 - 4GT [SPACE ABOVE THIS LINE FOR RECORDING DATA]
    DEED OF TRUST
    HULL
    LOAN #: 1961385
    MIN:1000608-0001961385-8
    PIN: 312-04-381
    CASE #: 023-3817097-703
    THIS DEED OF TRUST ("SECURITY INSTNUMENT") IS MADE ON DECEMBER 4, 2009. THE TRUSTOR IS
    TRAVIS HULL, AN UNMARRIED MAN ("BORROWER"), WHOSE ADDRESS IS 2313 SOUTH FAITH, MESA,
    AZ 85209. THE TRUSTEE IS FIDELITY NATIONAL TITLE ("TRUSTEE"), WHOSE ADDRESS IS 60 EAST RIO
    SALADO PARKWAY, 11TH FLOOR, TEMPE ARIZONA. THE BENEFICIARY IS MORTGAGE ELECTRONIC
    REGISTRATION SYSTEMS, INC. ("MERS") (SOLELY AS NOMINEE FOR LENDER, AS HEREINAFTER DEFINED, AND LENDER'S
    SUCCESSORS AND ASSIGNS). MERS IS ORGANIZED AND EXISTING UNDER THE LAWS OF DELAWARE, AND HAS AN ADDRESS AND
    TELEPHONE NUMBER OF PC) BOX 2026, FLINT, MI 48501-2026, TEL. (888)679-MERS. ACADEMY MORTGAGE
    CORPORATION, A UTAH CORPORATION ("LENDER") IS ORGANIZED AND EXISTING UNDER THE LAWS OF UTAH,
    AND HAS AN ADDRESS OF 1218 EAST 7800 SOUTH, SUITE 100, SANDY, UT 84094. BORROWER
    OWES LENDER THE PRINCIPAL SUM OF THREE HUNDRED ONE THOUSAND FOUR HUNDRED THIRTY
    NINE AND 00/100 DOLLARS (U.S. $301, 439.00). THIS DEBT IS EVIDENCED BY BORROWER'S NOTE DATED THE
    SAME DATE AS THIS SECURITY INSTRUMENT ("NOTE"), WHICH PROVIDES FOR MONTHLY PAYMENTS, WITH THE FULL DEBT, IF
    NOT PAID EARLIER, DUE AND PAYABLE ON JANUARY 1, 2040. THIS SECURITY INSTRUMENT SECURES TO LENDER: 
    THE REPAYMENT OF THE DEBT EVIDENCED BY THE NOTE, WITH INTEREST, AND ALL RENEWALS, EXTENSIONS AND
    MODIFICATIONS OF THE NOTE; (B) THE PAYMENT OF ALL OTHER SUMS, WITH INTEREST, ADVANCED UNDER PARAGRAPH 7 TO
    PROTECT THE SECURITY OF THIS SECURITY INSTRUMENT; AND (C) THE PERFORMANCE OF BORROWER'S COVENANTS AND
    AGREEMENTS UNDER THIS SECURITY INSTRUMENT AND THE NOTE. FOR THIS PURPOSE, BORROWER IRREVOCABLY GRANTS AND
    CONVEYS TO TRUSTEE, IN TRUST, WITH POWER OF SALE, THE FOLLOWING DESCRIBED PROPERTY LOCATED IN MARICOPA
    COUNTY, ARIZONA:
    ATTACH LEGAL
    WHICH HAS THE ADDRESS OF 2313 SOUTH FAITH, MESA, AZ 85209 ("PROPERTY ADDRESS");
    TOGETHER WITH ALL THE IMPROVEMENTS NOW OR HEREAFTER ERECTED ON THE PROPERTY, AND ALL
    EASEMENTS, APPURTENANCES, AND FIXTURES NOW OR HEREAFTER A PART OF THE PROPERTY. ALL REPLACEMENTS AND
    ADDITIONS SHALL ALSO BE COVERED BY THIS SECURITY INSTRUMENT. ALL OF THE FOREGOING IS REFERRED TO IN THIS SECURITY
    INSTRUMENT AS THE "PROPERTY." BORROWER UNDERSTANDS AND AGREES THAT MERS HOLDS ONLY LEGAL TITLE TO THE
    INTERESTS GRANTED BY BORROWER IN THIS SECURITY INSTRUMENT; BUT, IF NECESSARY TO COMPLY WITH LAW OR CUSTOM,
    MFRS (AS NOMINEE FOR LENDER AND LENDER'S SUCCESSORS AND ASSIGNS) HAS THE RIGHT: TO EXERCISE ANY OR ALL OF
    FHA ARIZONA DEED OF TRUST - 10/08
    -' 364.7 PAGE I OF 8

     

    Thanks in Advance.

    Regards

    Jain

    Filed under:
  •  02-03-2010, 5:34 PM 59246 in reply to 59202

    Re: Exceution time of Regex

    I suspect the reason is that the 2 regexs are working in very different ways.

    The first one with work its way forward looking for a complete match based mainly on the  "Space above this line is for recording data" (and other variants). Once it has located this, there are several sub-patterns of the form '.{0,50}' which basically allows the regex to grab the next 50 (or however many) characters no matter what they are and then start actually matching the pattern against the text again. Because the quantifier is greedy, it will always match as many characters as possible and only give them back if it needs to. The last part of the pattern is '\w+' which, in your test string matches the very last character in the string - the "8". Thus there is only a single match from the text.

    The second pattern (in effect -not necessarily in fact) allows the regex engine to ignore the bulk of the pattern and concentrate on the '\w+' right at the end. Only when it has found a match for that will it start looking backwards through the text to see if the lookbehind succeeds. The '\w+' will match on just about every word and number in the text and so the lookbehind processing will be invoked many (MANY) times and will fail quite often. However, when I tried this in the .NET-based Expresso regex tester it found 410 matches and took 0.405... seconds to do so. The matches seem to be most of the words including and after the "THIS DEED OF TRUST ("SECURITY INSTNUMENT")...." (by the way, are there really this number of spelling mistakes in what looks like a legal agreement???). The reason there are so many matches is that the '.{0,50}' style sub-patterns mentioned previously allow the regex great flexibility in making a match, especially as there are several of them in the whole pattern.

    I am always cautious when I see patterns that include multiple optional items in a row, especially when one (or more) of them use the '.' operator. Basically this can be a recipe for performance issues. Take the example of '\s*.{0,50}\s*Case\s*': this looks for the keyword "case" that has just about any number of character before it some of which may or may not be whitespace. Lets just imagine there are 70 space characters before the "Case" sequence - how is the regex to break them up between the first '\s*', the '.{0,50}' and the last '\s*'. In reality the first '\s*' will grab all of the whitespace (because the '*' is greedy) and the following 2 items can match no characters.

    But the problem starts if/when the regex engine starts to backtrack over this sequence: it will release a space form the first '\s*' match which will then be picked up by the '.{0,50}', again leaving the second '\s*' to match nothing. However the end result is the same as the first time, but the regex engine doesn't know this. Whatever caused the backtracking the first time will cause it again (as nothing has really changed) and so now the '.{0,50}' will give up the space only to have the second '\s*' match it. Yet again, nothing has really changed but the regex engine will move on to try to match further down the pattern only to fail yet again. By no I hope you can see that there are MANY different combinations that the regex engine can use to split the 70 spaces among the 3 parts of the pattern, all of which are effectively the same, but ALL must be tried before it can declare a "fail" for this part and backtrack further towards the start of the pattern to, usually to try some other starting point and to have to start the whole sequence outlined above all over again.

    If there are non-whitespace characters in the sequence, the basic problem still remains the same.

    Actually the situation is worse than this because the "Case" keyword can also be matched by the '.{0,50}' component and so (assuming that there is no "Case" sequence more than 50 characters after the starting point of this match) it will have to be found by backtracking through this section of the pattern, thus ensuring that situation described above will always occur.

    It is far better to work out what you really want to do and then do that. In this case, the '.' character set includes the '\s' characters (especially if the 'singleline' option is set which it must be for your pattern to work on the sample test - however you didn't mention this in your posting) and so the pattern

    .{0,52}?Case

    (for example - adding 2 to the allowed match length assuming that the '\s*' at either end can match 1 whitespace). What this does is to look to see if the next characters are "Case": if so then move on; if not then match 1 of the characters with the '.{0,50}?' part and try again. After at most 50 tests it will either match or fail. Even better, if the regex is "smart" enough, it can realise that backtracking is unnecessary and can fail the whole section rather than wasting time with repeated checks.

    If you use '.{0,52}Case' you will still be better off but you will still have the issue of the quantifier being greedy and so it may well match the "Case" sequence the first time thorugh and only "find" it later through backtracking. 

    Also, be aware that this will also find the "case" in "encased" (assuming the "ignore case" option setting is on). This applies to many of the "keywords" that you seem to be looking for in your pattern - placing '\s*' operators before and/or after the keyword will do nothing to help you as they can always match 0 characters and still be OK. It is better to use '\s+' if you know that there must always be whitespace before and/or after the "word" or use something like the '\b' anchor if this is appropriate.

    Susan

View as RSS news feed in XML