Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Remember where you come from

One thing I've noticed among rookie regex users is that they focus way too much on what they are trying to match and not on where they are trying to match it from. There is a tenancy to drastically under estimate the importance of their source data, which they are trying to apply their regex against.

I see this all the time on forums where posters asking for help almost never post any sample data, and most of the few that do post a sample they made up on the spot. Over on they RegexAdvice construction forum we've added guidelines for posting a question. The whole point of the guidelines are to expedite getting questions answered. One of the things we ask is that posters from a sample of the actual data they are trying to match. We explicitly ask they don't make up a sample. Unfortunately most people don't follow this request, along with others, in the guidelines, which generally result in those questions taking longer to get answered. On this particular point not supplying any sample is worst offense because anyone trying to help you can't see what you are working with and can only give an untested guess solution It like calling a mechanic on the phone and saying “My car isn't running right” leaving your number and waiting for him to call back to tell you how to fix it. He can call back and tell you to change your oil, but without more detail he can't know that will fix your problem. However making up a sample doesn't help much either and is almost just as bad. I know what people think they are being helpful and not boring people with the bloody details by providing a simple example to work with but generally it's not helpful since it doesn't represent their real problem. In regex construction details matter. Often the made up sample is too simple. So what they end of getting is a solution that works great for their sample but doesn't handle their real problem well, if at all. So now you are back where you started. This is like calling your mechanic on the phone and saying “My car making a funny noise.” While it's less vague than “isn't running right” it's still vague. You may know exactly what kind of noise, and how loud it is and where it coming from but that sentence doesn't communicate any of that. Made up samples are about the same when asking for regex help.

The majority of regular expression problems are content sensitive. Meaning the approach of how to match the target string may depend on where it is the the source string. Now sometime the target string is unique from the surrounding text that the context seems irrelevant. Say you are trying to find a string of digits somewhere in your source data. And there just happens to be only one occurrence of digits are in your source. Well it that case a generic regex that find digits will be good enough.


Source : “abc345def” or “a125bcd”


Regex: \d+


Now I said the content seems irrelevant in this case but it is as relevant as ever. The context is digits only occur once in the string.


But relevance becomes clearer if you have a source file where there are two or more occurrences of text that mimic your target data but only one is actually the data you want. For example let say you want to find area codes of a phone number. Well if you agree that an area code is a 3 digit number the regex \d{3} might be all need. It depends. On what you ask? Come on you know, say it with me....”Your Source data!”


If your source data is like the case I mentioned first, where the only digits in the files are area codes or if there are other digit in the file that are not area codes those other numbers are less than three consecutive digits then the simple regex should work


Sample


1) AL 205, 256, 334

2) AK 907

3) AR 479

4) etc.


But if your source is a bunch phone numbers


1) Mom (555) 123-4567

2) Sis 1-555-234-4567

3) Friend 5553456789


Well now you have a problem. While you general task is still the same, you have to approach it differently because of the source. You can't simply look for 3 consecutive digits since that would return 3 matches for each line. But if we know how the data is structured we can focus on how the data we want sticks out from the rest. In the above sample text area code meets on of three conditions


  1. 3 cons digits inside of parenthesis

  2. first group of 3 consecutive digits in a string of 1-xxx-xxx-xxxx format

  3. first 3 consecutive digits of a ten consecutive digit string


Now you can still match what you need in one regex but how you choose to approach it changes. Some regex authors will use look-around to examine the surrounding text, others will match the surrounding and use groups to pull out the desired data. Either way you have to expand your field of vision beyond what you are trying to match to address the problem.

But even in this simple example, understand how knowing what the source data look like is still important. Imagine someone asked you for help and you are trying to solve this problem without seeing the actual source above. If someone post (or ask) the question “I need a regex to match a US area code”

With No Sample: Any helpful regex author has to guess how your data is formatted. Some may assume it's a string of “(xxx) yyy-yyyy”, others may assume it's “xxx-yyy-yyyy”, some may even think it “State: xxx, xxx, xxx” and all might not it even consider “1-xxx-yyy-yyyy” or “xxxyyyyyyy”.

 

With Made up sample: “I have a string (555) 555-5555 and I need a regex to match a US area code”

Now regex author may assume all your phone numbers are in that nice format and that there are no variations. You'll most likely get a regex that match that format perfectly, but fail the other formats you didn't think were important enough to mention.


With Real Data: Regex authors don't have to guess they can look at your data analyze it themselves and even test edge cases and address issues you may have overlooked.


I hope this simple example shows how the same problem can have different approaches all because of the source. Where data in the wild is going to be much more complex than this which will only magnify the issue.


Now sometimes you can't provide real data because either its too much or private. With large chunk of data simply clip a portion containing what you are wanting to match. With private data making up sample is almost excusable but don't make up simple ones. Garble the real data but preserve the format, change names to literary or TV characters, change addresses to the address of some public office (City hall, post office) or landmark which this is essentially a made up sample but if you substitute values your samples won't be as bland and closer to your real data and edge cases.


Understanding your source data is crucial to your regex construction whether you are writing it yourself or asking for help.

Published Monday, October 01, 2007 1:42 AM by mash

Comments

 

Michael Ash's Regex Blog said:

Over at the Construction forum there is a sticky post which list the suggested posting guidelines for

April 10, 2010 11:00 AM
Anonymous comments are disabled