Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

A touch of Character Class

The square brackets character class is one of the more misunderstood of the basic regex features. This feature is supported in virtually all regex implementations. In fact off the top of my head I don't know an implementation that doesn't support it. Maybe it's not well documented in most tutorials or maybe the samples are not clear enough or maybe users are just skimming over the details for this one but I see this feature misused quite often.

This type of character class simply matches one and only one of the characters between the square brackets. That's pretty much it. Now there are a few caveats to what can go between the brackets but the output is the same. One, and only, character will be returned if there is a match.

Now a common error rookie regex users tend make by trying to write patterns inside the character class and make it match a particular sequence of characters. Trying to either have it match a particular fixed string of characters or another more generic regex pattern inside the character class. One of the obvious tells of this approach is when the same character appears in the character class multiple times, something like [RADAR] or [HELLO]. Listen up newbies, you can't write any sort of patterns within a character class so don't waste your time trying. If you have a fixed string like “ABC” don't try using a character class [ABC] to match that string as a a whole. As I stated in the previous paragraph this features is for matching a character (singular) from a group of characters, not an ordering of characters (plural) so what you don't get a match of “ABC” what you get is 3 matches since each character in the string is one of the 3 characters. There is nothing you can do inside of the brackets to make it return “ABC” as single match. You can't group these characters together since you can't write any patterns within the character class. All the characters in the class are tested against just one position in your input. Now you can add a qualifier, like + or * after the character class to make “ABC” one match but remember what the character class matches, one character. The qualifier allows the the character class to be applied multiple times, the whole class. None of the possible choices are eliminated in following application of the class. Every application of [ABC] test for 1of 3 possible characters, [ABC]+ test for 1of 3 possible one or more times. So while it will match “ABC” it doesn't only match “ABC” which brings me to my second topic, Order.

Along with the incorrect belief that you can include a pattern within a character class is that the characters within the class has to be in the same order as the characters being tested against. This isn't true. Everything I said about the regex pattern [ABC] is true for patterns [CBA] or [CAB] or [BCA]. They all match exactly the same thing. With a few exceptions, couple of that have to do with ranges one which I've discussed before, the order of the characters doesn't matter.  If you are using a range the value on the left of the hyphen has to be a lower ASCII value (or code point for Unicode) than the value on the right. Of the previous equivalent character class patterns the only way to write an equivalent range is [A-C]. [C-A] won't work and in some implementations will throw a compile error.

Now with these first couple of errors it maybe because new users think that alternation and character classes are interchangeable. They are not. Some tutorials may present that as the case especially if the examples for each are too simple. Any character class pattern can be written using alternation, however the reverse isn't always true. Alternation can include single characters or literal or general regex patterns. The character class pattern [ABC] can be written using alternation A|B|C. The alternation pattern of single characters C|D|E can be written [CDE] or[DCE] but the alternation pattern of multiple character patterns (ABC)|(DEF)|(GHI) can not be written in an equivalent pattern using just a character class. Though you could write a pattern that would include matches of the same 3 values you couldn't limit it to matching just those 3 values. Even negated character class can be written using alternation but that would generally lead to enormous patterns making it impractical to write pattern in that fashion. Those trying to write patterns inside of a character class needed to focus their attention to alternation and/or one or more grouping constructs.

The final error I see a lot is one that gets regex rookies and vets alike, regex metacharacters. With rookie this mostly fall within the trying to use a pattern within the character class by trying to use the qualifiers. Another is using the dot character which in this situation matches only itself. Within a character class in most implementations the only metacharacter that still retains it's special regex meaning is \ , which means it can be used as it normally is. The shorthand character class notations will work the same, except for \b which alters it's meaning, which make since if you think about it. You can't have any patterns within a character class and \b in it's use outside of a character class doesn't match a character and would only be used as part of a pattern.

However one metacharacter ^ alters it's behavior but only if it is the first character inside the brackets. This is the other position dependent case I was referring to earlier. All the other regular metacharacters that have a special meaning outside a character class lose there special meaning with one and are interpreted as literals, so the is no need to escape these characters though doing so isn't wrong just unnecessary. I talked about the hyphen, which gains a feature, few years ago.

Finally one last thing I see rookie regex users doing isn't really an error but just make it look like they aren't sure of when the use a character class and that's when they have a positive character class with only one character in it like [k]. There is no reason to do this. Just type the character. If you were negating a single character then that makes sense but otherwise it's just extra typing.


The character class is great when you are trying to match (or negate) a single character at one position in your input and you have lots of possible characters to choose from, but anything else is outside of its domain.






Sponsor
Published Thursday, January 31, 2008 10:49 AM by mash

Comments

No Comments
Anonymous comments are disabled