.*? The reason to any slow regex!
24 February 07 05:10 PM | sirpadk | 48 Comments   

Hi guys,

We all have been there, we have some text we need to allocate using a regex, and we cannot really define what can be between 2 sub matches, so we simply uses the easiest solution .*?. You will probebly wont notice the problem, until the day when the regex cannot find what we are looking for, and then it's takes hours to execute.

The problem will most likely occur when we use .*? twice or more in the same regex. let's try to understand why this is happening.

The reason to this issue is very simple. .*? means any char, zero or more times, but the shortest available.
If my regex is this one: <html>(.*?)<body>(.*?)<span> and the string I'm looking in has an html tag followed by a body tag followed by a span tag, then the regex will execute fast and I will be happy. but if we don't have a body tag or a span tag than the regex will go nuts trying all the possible combinations. why? because we ask it to do so!

In order to simplify the problem, I will create small sample. my text is: "abcde1234" and my regex is [a-z].*?(a).*?\d now I'll write the candidates the regex can check for the first part [a-z].*?
a, ab, abc, abcd, abcde, b, bc, bcd, bcde, c, cd, cde, d, de, e (15 candidates)
Now lets look at the candidates of the second part .*?\d
abcde1, abcde12, abcde123, abcde1234, bcde1, bcde12, bcde123, bcde1234, cde1, cde12, cde123, cde1234, de1, de12, de123, de1234, e1, e12, e123, e1234, 1, 12, 123, 1234, 2, 23, 234, 3, 34, 4 (30 candidates) 15 option combine with 30 will be 450 options. in case of "abcdef1234" we have 21 combine with 31 that's 651 options. add a number "abcdef12345" and you get 21 times 40 equal to 840 and so on and so on......

You can see for your self how by using simple math we can see how each time we add a new char, we get multiple numbers of new combinations for candidates. when you have 5000 chars in a source of an html page, then we are talking about billions of candidates. When you build a regex you should avoid writing regex that has too many candidates.

Here are some methods to reduce the candidates in your regex

  1. run the regex on a small part of the code which is relevant. You can do that by using another regex for allocating the area you wish to work with.
  2. avoid using more than 1 .*? in your regex. remember 2 .*? makes 2 dimension 3 .*? makes 3 etc. the more dimensions the more candidates you have.
  3. use negative definition instead of .*?

I wrote in the beginning of this post that the problem will be relevant the day we cannot find the match, well the reason for that is that when the regex engine will stop execution it finds the match. so if we have 12 billion candidates, but the match is found within the first 10000 candidates, than we will not notice the performance issue. but the day that the regex will make a full scan of all candidates we will see that the regex is slow.

I hope this information will help you writing efficient regex!

May the force be with you.

 

Sponsor
Guys you are misusing the blog as forum! please stop!
02 February 07 08:19 PM | sirpadk | 54 Comments   

Hi there,

In the past months there is a new trend in the regexadvice.com community site. Users are posting a blog post with simple trivial questions which should be posted at the discussions forums.

As far as I understand, writing a blog about a specific technology is a tool where one can share knowledge with other users. If you are new to regex, than what are you writing a blog on? If you have a question go the discussions forum and ask, or even better search the forum, some one had faced your problem before, that is more than likely.

Guys, this community server is our to use, and If we don't use it wisely, we will just end up with having a weird bland website, where new comers cannot understand where to look for what.

Be good.

Sponsor
Protecting your regex from future changes in the source code
02 February 07 07:39 PM | sirpadk | 65 Comments   

Hi guys,

Long time since I've written here, well let's just say it's been busy at work and home front!

Today I had to fix a web crawler due a regex that failed to track a tax amount. The regex was not written by me, but it reminded me what I have learned long time ago: "Remember to protect your regex from future changes in the source code!!!". It gave me the idea to share with you which steps I take in order to fulfill this rule.

When we use an HTML tag in the regex, we will usually write <td>name or <td align="left" >name but who knows which attributes the website we are extracting data from will use, and we know that those attributes can be changed from time to time. So we will maybe try to solve it by writing <td.*?>name but this is dangerous, because this regex is bad performance, and includes the risk of allocating more than you really looking for.

The best way to solve this issue is by using a negative definition. an HTML tag may include basically any char in the world excepts >, so by using this regex <table[^>]*> we can be sure we will never allocate more than a single table element. notice that we do not need to use ?* as this is already taken care of by using the [^>]  definition. Geeked On top of that we do accept any form for text within the tag. attributes, javascript calls styling you name it, we support it!

Now what about white spaces? can we know for sure how many spaces\tabs\newlines will be created by the server side code? No we can not! That is why it's always a good idea to use \s* ALL THE TIME!!! Use it like there is no tomorrow! (Actually use it because you want the regex to work tomorrow as well Stick out tongue )

Lets say the code we need to allocate is the following:

<tabel>
<tr>
<td class="bla" >


name</td>



</tr></table>

Due to the way browsers reads HTML code, the white spaces are removed, so web masters tends to be careless with them, especially when the code is generated automatically by a server side code. a smart regex that is taking this fact is consideration is this one: <table[^>]*>\s*<tr[^>]*>\s*<td[^>]*>\s*([a-zA-Z]+)\s*</td[^>]*>\s*</tr[^>]*>\s*</table[^>]*>  Notice the usage of the \s*  between any element which may be separated by an unknown number of white spaces! Another example is javascript code. In javascript code spaces are not a factor. there is no difference between x=0 and x = 0 and x             =                 0 That is why we need to remember that when we write regex that are extracting data from javascript code. the name space javascript it self can be used both with and without a space (e.g. BLOCKED SCRIPTmyfunc(); BLOCKED SCRIPT myFunc(); )  On the other hand, you cannot break the commands or variable name with a space in javascript. So you need to know a bit about the syntax of the code you are extracting, before you can be sure where you may and may not meet white spaces.

That is all for now.
I hope you will find this information useful.

Be good

Sponsor
Filed under: , ,
How does the regex think?
04 December 06 09:30 PM | sirpadk | 14 Comments   

If you want to write good, efficient regular expressions, than you would like to know, and understand how the regex object is thinking. What happens when you execute a regex? I'll try to put some light on that using conclusions I have made after working with regex for almost 4 years.

 Disclaimer: I didn't see, nor designed a regex object, all my experience is based on working with regex, trying to optimize them, and concluding how come one regex run better than the other.

 So, we all know the DOS days, dir command, and *.txt. that is a very simple regex. we accept all that follows by a dot and a t and an x and a t. but can we understand what the computer was doing?

 Let's say I give you the task of finding all the students which name starts with an C within a list of students. Try to think what you'll be doing. you will look on each line, and acknowledge it as a students name entry, than you will count in the head every time you'll find an entry that starts with the letter C. Now lets say that we want the number of all students that the first name start with C and the last name start with A. Here you will again look on each line, and acknowledge it as a students  name entry, than you will count in the head every time you'll find an entry that starts with the letter C but now there is a new criteria, the second name must start with A, you will acknowledge the second name by the first space after the first name. Well you are human, you are clever, but you are slow, and you get tired.

A program cannot acknowledge a new line for you, it cannot see where a first name ends and the second begins. That is why we have so many syntax in regex, because we need to specify mile stones in the text (the begining and the end of the line in our case), and because we need to define the pattern that define a first name and a last name. in this case a new line is the beginning of a student name entry, and an end line is the end of a student name entry. the space is what separates the first name from the second name. a name of a student can only be from a to z or from A to Z, in regex a-Z, and a name can minimum have 2 chars, so we can say the reg is ^[a-Z]{2,}\s[a-Z]{2,}$ simple, isn't it?

Now that we build this regex, let's say we are the regex object, and let's see step by step what the regex is doing.

The first thing the reg will do is to compile the regex, after that it will take the first char in the source, and ask if it's a ^, if it is it will put it in temp container, and will it will look on the next element in the expression, that is a-Z at least 2 chars. it will ask if the char after the ^ in the source is between a-Z, and if it is it will put in in the temp container, if not it will clear the temp container, and will again ask if the char is a ^. lets say that the second char is between a-Z, than the regex object will look at the 3rd char, and if it's between a-Z it will add it to the temp container, if not it will empty it, and check if it's like ^, now that we have 2 a-Z chars, we can have 2 ways to continue matching, either er find a space, og more a-Z chars. at any point of failure to match this 2, the regex will empty the temp container, and wil try to begin again from one char after the point where it started the match that failed (this is very important to understand in order to debug slow regex). At a point it will come to a space, than it will again look on a a-Z char atleast 2 times, and when it will find 2 a-Z chars, it will look on either more a-Z or an end of line $, when it will find the end of line it will add the value from the temp container to the matches array. and it will start again looking for a match from one char after the match (that is why slow regex are usually regex that do not find a match).

I guess you are just as confused, if not more after reading this part, but no fear, here is a task. take a piece of paper, and write 3 names in it. 1 name on each line, and one space between the first and the last name. now take another piece of paper, and draw a table, with a container column, and a caret column, now scan the list with names using the regex, and the steps I wrote, update the table with a new row for each caret movement.

That is it, it's a simple process, the complication lies in the many different definition we can give the regex to look after. but don't let that scare you! this complication are a tool for you, there are there to help you, to help us. When you want to write a regex, ask your self what do I need to match, a tlf number? That is a number with let's say 10 digits, no other chars are allowed, but all tlf numbers start with tlf: so we can make sure we dont save a social security number as a phone number. So just like you in your head will look for 10 digits pattern that comes after the milestone tlf:, write  a regex that is looking for a milestone tlf:\s and that is followed by 10 digits pattern \d{10} tlf:\s\d{10}

That's all for today,

Be good

Sponsor
How to Extract Data from an Alternated Text Like the Internet?
16 November 06 10:01 PM | sirpadk | 7 Comments   

Many ppl believe that using regex to extract data from a web page is a suicide task. the text changes rapidly, you'll never be able to know what will change, and how, you are risking saving the wrong information is some of the arguments these ppl will tell you.

Well let me get one thing strait, I'm working in a company which is making money from using regex to extract information from hundreds of websites. those websites do change, but because we are thinking very good, and analyzing the site before we jump and write a regex, we can make this impossible mission, very easy.

As I said before, the first rule is Analyzing Analyzing A-n-a-l-y-z-i-n-g! Look on the source code, and look for anchors, and I don't mean <a> tags, but look for text, or tags that you know will less likely to change. a class attribute in an HTML tag will change, a message or a banner will change, but if the site I'm extracting data from is a CD shop site, and there is a lable saying Price:, than this will probably won't change, and if it does, than we can say that the site had a major change, and we will probably need to write more than one regex from scratch.

The second rule is "Better no data than wrong data" write you regex not too flexible, as you wouldn't want to save the ISBN number of a book as it's price.

The 3rd rule is zooming. it performs better, and it minimize risks for mistaking data. Zooming is a process where we start with isolating an area of information with a little regex that is using an anchor "Price:(.*?)</table>" this regex will give us a smaller area of text, which we know for sure has the information we need, and not all the other just that may confuse our regex. now we can write a detailed regex, or even better many small regex, that will allocate the information we need. the performance will be very good as the text is very small, and the risk of extracting the zip code of the company address as a price is 0.

Well that's all for tonight, there are many other rules that i follow, and I'll share them later on.

May the force be with you

sirpadk

Sponsor
The First Step Is The Hardest
16 November 06 09:41 PM | sirpadk | 7 Comments   

Hallo boys and girls

I'm sirpadk, and I have decided to write a blog on regex, because I believe I have a lot of knowledge in this field due to my type of work.

I am a part of the Data Collection Team in Infare Solutions, a danish company specializing in extracting airline data from the internet, and supplying an analyzing tools to airlines in the entire world. regex is the main method we are using to extract the data from the many websites, and after 3 years of daily writing regex, you kinda get the hang of it in many ways. Through my work I have gathered many best practice resolutions that help me and my colleagues to write good and robust regex, that rarely fails, and that perform good and fast. I will try to help you guys understand how you too can write good regex, that are stable, and that wo'nt need to be rewrited each and every day because the site is changing his layout a bit. I will discussed with you how a regex is read by the regex object, and by understanding that you will be able to find the places where the regex is misusing resources, I'll also show you some guide lines regarding writing a low performance regex.

 I hope you'll find this information useful.

May the force be with you

sirpadk

Sponsor