Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Wayne's Regex Rants

A Split Approach

Ahh, my first blog entry...

The Regex.Split() method offers a clean way to break a delimited string of text into substrings.  Its advantage over the standard String.Split() method is that you can use a delimiter that is variable or not precisely known.  A simple example suited for regex is a comma-delimited list where the delimiter may or may not be followed by a space, as:

string[] fields = Regex.Split(delimitedList, ", ?");

There are times, however, when the Split method is not the best choice for splitting a string.  Consider a comma-delimited string; where each field may be surrounded by quotes, and when quoted the field may contain embedded commas.  Example:

Mister King,"123rd St, Redmond",425-555-1234
and the split should produce:
[0] => Mister King
[1] => "123rd St, Redmond"
[2] => 425-555-1234

To use the Regex.Split() method, you would need to figure out some way to match a comma, except when it is within an opening-quote and a closing-quote. While it may be possible to devise such a pattern, it would be pretty tricky and certainly more complex than an alternative approach.  That alternative approach is to use the Regex.Matches() method with a pattern that explicitly matches each field instead of each delimiter.

Regex rex = new Regex(
      //match quoted text if possible
      //otherwise match until a comma is found
      @"  ""[^""]*""  |  [^,]+  ",
      RegexOptions.IgnorePatternWhitespace);
int i=0;
foreach(Match m in rex.Matches(delimitedList))
   Console.WriteLine("[{0}] => {1}", i++, m.Value);

If more than one delimeter is possible (e.g., commas or semi-colons), or if more than one set of quotes are possible (e.g., double-quotes or single-quotes), it is easy enough to enhance the pattern:

      @" ""[^""]*""  |  '[^']*'  |  [^,;]+ "

But, it gets a little tricky if you need to allow for empty fields, such as input like:

Mister King,,425-555-1234

A la,

Regex rex = new Regex(
      @" (?: ,|^ )  ( ""[^""]*""  |  [^,]* )  ",
      RegexOptions.IgnorePatternWhitespace);
int i=0;
foreach(Match m in rex.Matches(delimitedList))
   Console.WriteLine("[{0}] => {1}", i++, m.Groups[1].Value);

We now match both the delimiter and the field that immediately follows it.  To accomplish that, the first grouping in the pattern matches either a comma (the delimiter), or the beginning of input (the first field does not have a delimeter preceding it, so we have to match beginning of input).  When we extract the fields from the collection of matches, then, we just need to ignore the delimiters we've matched.  The code above does that by making the delimiter-match a non-capturing group ((?: )); and by acquiring the field text from the capturing group instead of the entire text of each match.


The above pattern works for all cases except when the first field is empty.  I'll leave dealing with that as a future excercise.  Hint: match the entire input with a single match, and capture all the fields into a capturing group.

The above pattern matches a field and the delimiter that immediately precedes it.  A similar approach is to match a field and the delimiter that immediately follows it.  Note the similarities of this pattern and code:

Regex rex = new Regex(
      @" ( ""[^""]*""  |  [^,]* )  (?: ,|$ ) ",
      RegexOptions.IgnorePatternWhitespace);
int i=0;
foreach(Match m in rex.Matches(delimitedList))
   Console.WriteLine("[{0}] => {1}", i++, m.Groups[1].Value);

Note this approach doesn't eliminate the empty-field issue.

-Wayne

Sponsor
Published Monday, January 12, 2004 8:08 PM by wayneking

Comments

 

wayneking said:

For all the coding I do, it's safe to say that I'm a moron when it comes to RegEx stuff. Normally I post a 'I need something that does this' type email to the RegEx list and Dave Wanta writes it for me (Thanks Dave). This is something that I needed to do today though. You've just saved me loads of time.

Danka


April 16, 2004 12:14 PM
 

wayneking said:

Very elegant. Nice job.
February 18, 2005 8:24 PM
 

wayneking said:

I originally tried using a RegEx-based solution for an event-driven object model used for parsing fixed-width and delimitted text files. It worked great for "narrow" records (i.e. <10 or so fields), but it didn't scale well at all for "wide" records. I had one example where an 80 field wide record spent over 3 minutes trying to build the Match collection before I terminated it (this was using the .NET Framework 1.1 RegEx classes).

I looked for a while for some help in optimizing the RegEx pattern, but nothing anyone suggested made any significant improvement, so I ended up abandoning the RegEx route and reverting to a more traditional Split() for delimitted and Substring() for fixed-width.

Have you seen similar performance problems with wide/large recrod matching?

Tony
February 19, 2005 5:59 AM
Anonymous comments are disabled