Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Justin's Regex Blog

Thinking in Regex

The not so basics of NOT!

It is very easy to build a regular expression that matches a particular pattern, but not one that matches the absence of a pattern. Thankfully we've been given some tools in the form of assertions that make the ability to perform a NOT operation quite easy. Let's start by looking at the negative lookahead assertion using a pattern that has been in hot discussion lately over on Darren's blog The sugary synax we love... Lookaround .

The pattern matches &, in order to escape it as &, but it doesn't escape the & if it is already part of an entity escape.

&(?!lt;|rt;|amp;)

We used the negative look-ahead to match all ampersands that aren't followed by an entity. Not too hard at all. The NOT in this case means look for specific items and fail the match if they exist. Along with the negative look-ahead assertion there is an equivalent look-behind assertion. With it you can match all patterns that aren't in turn preceded by something.

Now, let's talk about the other forms of NOT that we are given with regular expressions that aren't related to assertions. First off we can build a negative character class group. This allows us to match all but a certain set of characters. If we wanted to begin building a simple version of the assertion pattern, we might at least make sure that the next character following the & wasn't an l, r or a.

&[^lra]

Now we are matching all & that are NOT followed by the l, r, or a... If we continue building the pattern out we can even use character classes to form complex NOT operations for entire strings and groups of strings.

&($|[^lra]|a[^m]|am[^p]|amp[^;]|[lr][^t]|[lr]t[^;])

We've now constructed an identical pattern using the character class NOT operator to perform the same operation as the negative look-ahead assertion. If you look at the comments in the post I've referenced you'll also notice a third syntax I use for creating a NOT operator, and that is to optionally match something and then only if it matches try to match an impossible character.

&(?(lt;|rt;|amp;)^)

The above pattern tries to match a beginning of string character if it finds any of the patterns or literals we don't want to follow the & character. Since we shouldn't be able to match the start of string it will cause the expression to fail and the conditional will act like a NOT operator.

Hopefully you've found some of this interesting, then again maybe not. Each tool in a regular expression can be viewed as a hint at how the pattern is going to operate. Some of the tools affect the string scanner (such as assertions that do a forward or backward scan based on the current stream offset) others control backtracking and greediness. By manipulating the tools and understanding how they work you can make the most performant expressions.

Sponsor
Published Friday, August 06, 2004 1:50 AM by jrogers
Filed under:

Comments

 

TrackBack said:

August 10, 2004 6:08 AM
Anonymous comments are disabled