A few years ago a wrote about about
a
bug in Internet Explorer's Regex engine that affected patterns
with lookaheads. Well the bug came back in the form of a question on
RegexAdvice.com. It too was a password regex, though not as complex
as the previous pattern that introduced me to this bug.
The first pattern had three conditions
that were being tested for with lookaheads.
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$
With the current pattern only one
lookahead was being used.
^(?=.*?\d)[a-z][a-z0-9]{5,7}$
In both patterns the pattern had a min
and max length. In to original attempts of both the length was being
checked after the lookahead(s) test. While this is perfectly fine in
a non VBScript/JScript world, this is were the bug kicks in those
regex engines. Actually it's probably the same regex engine for both
languages, which is probably why it only effects IE it's the only
browser that uses VBScript and Jscript natively. I don't recall in
my original testing for this bug if I tested it server-side given my
previous blog comments, mostly likely I only tested client-side.
However the recent question it was failing server-side so it's not
actually in the browser but more likely the DLL for those languages.
Anyway the previous blog article covered the behavior that was
happening.
Steve Levithan looked much closer at
the problem in general and discussed
it on his blog. He came to the conclusion that the qualifiers
with a minimum boundary of zero, within the lookahead were the
culprit. He provides a couple of simple examples. I think he's
partially right but I don't think the 1+ qualifiers are excluded from
the problem. I think his provided examples were a little too simple
for them to be effected.
OK let look at the regex pattern in the
recent question
^(?=.*?\d)[a-z][a-z0-9]{5,7}$
The requirements were 6 to 8
alphanumeric (English) string that started with an alpha character,
with at least one digit. Let's use “abc123” as the text string.
Now the above pattern was supplied by
one of the resident pros who frequent the message boards. Let's get
pass the fact the pattern itself is correct and satisfies the
requirements. I'm going to rewrite the pattern to
^(?=\D*\d)[a-z][a-z0-9]{5,7}$
Now this is functionally the same, it's
just within the lookahead the pattern is greedy instead of lazy and
it will make the point I'm trying to make easier to see (I hope)
without having to deal with backtracking. Now this pattern suffers
from the bug in VBScript/JScript. Now Steven suggested the one or
more qualifier (+) doesn't suffer from the problem but change the
to a + , which doesn't effect the match because the first test after
the lookahead is for an alpha so the lookahead placed as it is will
always match at least one character with the \D* pattern. So now we
have
^(?=\D+\d)[a-z][a-z0-9]{5,7}$
which is also bitten by the bug.
I first tried the same approach on my
original attempt dealing with the bug, but no luck. I tried using
the + qualifier, still didn't work. Now the person posting the
question stated without the lookahead the remaining pattern worked
fine, excluding the at least one digit test. So I begin testing the
part of patterns such as
^a-z][a-z0-9]{5,7}$
^(?=\D*\d)[a-z]
^(?=\D+\d)[a-z]
are just a few attempts, all work as
expected. It was only when I put the whole pattern back together
that it begin failing. But after some more trial and error I think
I've come across a pattern with the bug. Going back to my original
examination of the problem I discovered that this pattern ^(?=\D*\d)[a-z][a-z0-9]{2,7}$ matches
the test string “abc123” now this doesn't seem to quite fit with
my original assessment of what was happening because in that pattern
after the lookahead test it's just testing for a certain number of
any characters, but it this pattern it's looking for a specific range
of characters but if you up the min boundary on the last qualifier by one to
^(?=\D*\d)[a-z][a-z0-9]{3,7}$ the pattern fails again. Now if you
change the zero to infinity qualifier in the lookahead to one to infinity qualifier in
the modified pattern so you get ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ the pattern
matches again. Bump up the min boundary of this new pattern to
^(?=\D+\d)[a-z][a-z0-9]{4,7}$ Bugaboo! The pattern fails again.
Now I don't have access to the source
code so this is just supposition but here's what I believe is
happening. I think the data examined by the lookahead is being
stored in a stack structure. But just looking at the patterns that
are working and failing it looks like once the lookahead is satisfied
values when a qualifier is encountered in the consuming portion of
the pattern, the lookahead's match is reconsidered with the minimum
boundary of the qualifier in the lookahead popped off the stack of
the lookahead's match. Let's look at the first pattern that worked
^(?=\D*\d)[a-z][a-z0-9]{2,7}$
OK we'll start with after the lookahead
is matched. Now the lookahead is supposed to be non-consuming so the
pointer should still be at the beginning of the string.
^(?=\D*\d) matches “abc1” Now the
rest of the pattern matches normally until we get to the qualifier in
the consuming portion of the pattern we are looking for at least two
alphanumeric characters. At that point the lookahead match is
reconsidered, the lower bound being zero, nothing is remove for the
stack but the current character pointer now points to the character
after the lookahead's match value of “abc1”.The qualified part of
the pattern [a-z0-9]{2,7}$ can be satisfied by “23” so we get a
match
Now if we do the same thing with
^(?=\D*\d)[a-z][a-z0-9]{3,7}$ and apply the same logic the regex
fails because the qualifier in the consuming portion is look for at
least 3 characters and there aren't that many if it tries to satisfy
that part of the pattern after “abc1” in the test string.
Now let's look at
^(?=\D+\d)[a-z][a-z0-9]{3,7}$ with the same logic. The only
difference from the previous pattern is the lower boundary of the
lookahead qualifier. It's now 1. So if we pop 1 character of the
stack of the lookahead's match of “abc1” we get “abc” leaving
us with “123” to be matched by the consuming qualifier, which is
just enough.
Now take ^(?=\D+\d)[a-z][a-z0-9]{4,7}$
apply the same logic, now the pattern fails because the consuming
qualifier is look for as least 4 characters after the stack popping
of lookahead's match.
If you changed the lookahead qualifier
to {2,} the pattern would match. You can continue upping the qualifiers by one in the consuming portion to make the pattern not match then non-consuming part of the pattern to get the match so the behavior seems pretty
consistent with my theory which the consuming qualifier is pointing
to the end of the lookahead match and moving back the number of
character of the lookahead minimum boundary. It also seem to explain
the effect of my original encounter with the bug. It may as well as
why the workaround of testing the length first with another avoids
the bug because that consuming qualifier it that case is usually just
, though + seem to work too which doesn't quite fit but consuming
qualifiers with minimum boundaries of zero or one don't seem to be
effected in any case. In all of the above test cased those values
were below the minimal threshold of every successful test. However
the test cases above has only one lookahead that doesn't backtrack,
who know what influence backtracking, additional lookaheads or how
addition qualifiers in the consuming parts pattern would be
effected. Now while the test values support my theory I can't say for
sure that things are happening exactly the way I've laid out. The
actual mechanics may be different but whatever is happening under the
hood clearly pointers are being corrupted and the regex engine is
loosing it's place
The thing that was so confusing about
this was it only kicks in with a qualifier in the consuming portion
is encountered but if there was something between the lookahead and
the qualified portion it match normally. So this is something it's
really hard to make test cases for because you get these ghost value
popping up latter on in the test than I expected. Not to mention the
pattern itself is correct so even with a tool that will let you step
through the matching you don't get this behavior unless that tool is
using VBScript's regex engine and I haven't seen such a tool for that
engine.
If you are using lookaheads with
JavaScript client-side then you are going to be susceptible to the
bug in IE because it will use the Jscript engine. And while you
should always validate server-side if VBSCript or Jscript is your
server-side language you are still at risk. So a platform like
classic ASP which uses both of those languages by default is at risk
client and server side, but a platform like PHP while it still suffer
the bug client-side for IE, should work correctly server-side which
is using a different regex engine. Same goes for non-web clients
using the JScript/VBScript DLL.
The workaround for the strong password
type of regex is when using lookaheads to include the upper boundary
test(s) before the no upper boundary test then use .* to consume
characters. The bound test should keep the pattern from running
forever. However depending on the complexity of your criteria this
may not always be an option but try it first anyway.