The square brackets character class is
one of the more misunderstood of the basic regex features. This
feature is supported in virtually all regex implementations. In fact
off the top of my head I don't know an implementation that doesn't
support it. Maybe it's not well documented in most tutorials or
maybe the samples are not clear enough or maybe users are just
skimming over the details for this one but I see this feature misused
quite often.
This type of character class simply
matches one and only one of the characters between the square
brackets. That's pretty much it. Now there are a few caveats to
what can go between the brackets but the output is the same. One, and
only, character will be returned if there is a match.
Now a common error rookie regex users
tend make by trying to write patterns inside the character class and
make it match a particular sequence of characters. Trying to either
have it match a particular fixed string of characters or another
more generic regex pattern inside the character class. One of the
obvious tells of this approach is when the same character appears in
the character class multiple times, something like [RADAR] or
[HELLO]. Listen up newbies, you can't write any sort of patterns
within a character class so don't waste your time trying. If you have
a fixed string like “ABC” don't try using a character class [ABC]
to match that string as a a whole. As I stated in the previous
paragraph this features is for matching a character (singular) from a
group of characters, not an ordering of characters (plural) so what
you don't get a match of “ABC” what you get is 3 matches since
each character in the string is one of the 3 characters. There is
nothing you can do inside of the brackets to make it return “ABC”
as single match. You can't group these characters together since you
can't write any patterns within the character class. All the
characters in the class are tested against just one position in your
input. Now you can add a qualifier, like + or * after the character
class to make “ABC” one match but remember what the character
class matches, one character. The qualifier allows the the character
class to be applied multiple times, the whole class. None of the
possible choices are eliminated in following application of the
class. Every application of [ABC] test for 1of 3 possible characters,
[ABC]+ test for 1of 3 possible one or more times. So while it will
match “ABC” it doesn't only match “ABC” which brings me to my
second topic, Order.
Along with the incorrect belief that
you can include a pattern within a character class is that the
characters within the class has to be in the same order as the
characters being tested against. This isn't true. Everything I said
about the regex pattern [ABC] is true for patterns [CBA] or [CAB] or
[BCA]. They all match exactly the same thing. With a few exceptions,
couple of that have to do with ranges one which I've discussed
before, the order of the characters doesn't matter. If you are using a range the value on the
left of the hyphen has to be a lower ASCII value (or code point for
Unicode) than the value on the right. Of the previous equivalent
character class patterns the only way to write an equivalent range is
[A-C]. [C-A] won't work and in some implementations will throw a
compile error.
Now with these first couple of errors
it maybe because new users think that alternation and character
classes are interchangeable. They are not. Some tutorials may
present that as the case especially if the examples for each are too
simple. Any character class pattern can be written using
alternation, however the reverse isn't always true. Alternation can
include single characters or literal or general regex patterns. The
character class pattern [ABC] can be written using alternation A|B|C.
The alternation pattern of single characters C|D|E can be written
[CDE] or[DCE] but the alternation pattern of multiple character
patterns (ABC)|(DEF)|(GHI) can not be written in an equivalent
pattern using just a character class. Though you could write a
pattern that would include matches of the same 3 values you couldn't
limit it to matching just those 3 values. Even negated character
class can be written using alternation but that would generally lead
to enormous patterns making it impractical to write pattern in that
fashion. Those trying to write patterns inside of a character class
needed to focus their attention to alternation and/or one or more
grouping constructs.
The final error I see a lot is one that
gets regex rookies and vets alike, regex metacharacters. With rookie
this mostly fall within the trying to use a pattern within the
character class by trying to use the qualifiers. Another is using
the dot character which in this situation matches only itself.
Within a character class in most implementations the only
metacharacter that still retains it's special regex meaning is \ ,
which means it can be used as it normally is. The shorthand character
class notations will work the same, except for \b which alters it's
meaning, which make since if you think about it. You can't have any
patterns within a character class and \b in it's use outside of a
character class doesn't match a character and would only be used as
part of a pattern.
However one metacharacter ^ alters it's
behavior but only if it is the first character inside the brackets.
This is the other position dependent case I was referring to earlier.
All the other regular metacharacters that have a special meaning
outside a character class lose there special meaning with one and are
interpreted as literals, so the is no need to escape these characters
though doing so isn't wrong just unnecessary. I talked about the
hyphen, which gains a feature, few years ago.
Finally one last thing I see rookie
regex users doing isn't really an error but just make it look like
they aren't sure of when the use a character class and that's when
they have a positive character class with only one character in it
like [k]. There is no reason to do this. Just type the character. If
you were negating a single character then that makes sense but
otherwise it's just extra typing.
The character class is great when you
are trying to match (or negate) a single character at one position in
your input and you have lots of possible characters to choose from,
but anything else is outside of its domain.