|
|
Regex Musings
-
A few years ago a wrote about about a
bug in Internet Explorer's Regex engine that affected patterns
with lookaheads. Well the bug came back in the form of a question on
RegexAdvice.com. It too was a password regex, though not as complex
as the previous pattern that introduced me to this bug.
The first pattern had three conditions
that were being tested for with lookaheads.
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$
With the current pattern only one
lookahead was being used.
^(?=.*?\d)[a-z][a-z0-9]{5,7}$
In both patterns the pattern had a min
and max length. In to original attempts of both the length was being
checked after the lookahead(s) test. While this is perfectly fine in
a non VBScript/JScript world, this is were the bug kicks in those
regex engines. Actually it's probably the same regex engine for both
languages, which is probably why it only effects IE it's the only
browser that uses VBScript and Jscript natively. I don't recall in
my original testing for this bug if I tested it server-side given my
previous blog comments, mostly likely I only tested client-side.
However the recent question it was failing server-side so it's not
actually in the browser but more likely the DLL for those languages.
Anyway the previous blog article covered the behavior that was
happening.
Steve Levithan looked much closer at
the problem in general and discussed
it on his blog. He came to the conclusion that the qualifiers
with a minimum boundary of zero, within the lookahead were the
culprit. He provides a couple of simple examples. I think he's
partially right but I don't think the 1+ qualifiers are excluded from
the problem. I think his provided examples were a little too simple
for them to be effected.
OK let look at the regex pattern in the
recent question
^(?=.*?\d)[a-z][a-z0-9]{5,7}$
The requirements were 6 to 8
alphanumeric (English) string that started with an alpha character,
with at least one digit. Let's use “abc123” as the text string.
Now the above pattern was supplied by
one of the resident pros who frequent the message boards. Let's get
pass the fact the pattern itself is correct and satisfies the
requirements. I'm going to rewrite the pattern to
^(?=\D*\d)[a-z][a-z0-9]{5,7}$
Now this is functionally the same, it's
just within the lookahead the pattern is greedy instead of lazy and
it will make the point I'm trying to make easier to see (I hope)
without having to deal with backtracking. Now this pattern suffers
from the bug in VBScript/JScript. Now Steven suggested the one or
more qualifier (+) doesn't suffer from the problem but change the
to a + , which doesn't effect the match because the first test after
the lookahead is for an alpha so the lookahead placed as it is will
always match at least one character with the \D* pattern. So now we
have
^(?=\D+\d)[a-z][a-z0-9]{5,7}$
which is also bitten by the bug.
I first tried the same approach on my
original attempt dealing with the bug, but no luck. I tried using
the + qualifier, still didn't work. Now the person posting the
question stated without the lookahead the remaining pattern worked
fine, excluding the at least one digit test. So I begin testing the
part of patterns such as
^a-z][a-z0-9]{5,7}$
^(?=\D*\d)[a-z]
^(?=\D+\d)[a-z]
are just a few attempts, all work as
expected. It was only when I put the whole pattern back together
that it begin failing. But after some more trial and error I think
I've come across a pattern with the bug. Going back to my original
examination of the problem I discovered that this pattern ^(?=\D*\d)[a-z][a-z0-9]{2,7}$ matches
the test string “abc123” now this doesn't seem to quite fit with
my original assessment of what was happening because in that pattern
after the lookahead test it's just testing for a certain number of
any characters, but it this pattern it's looking for a specific range
of characters but if you up the min boundary on the last qualifier by one to
^(?=\D*\d)[a-z][a-z0-9]{3,7}$ the pattern fails again. Now if you
change the zero to infinity qualifier in the lookahead to one to infinity qualifier in
the modified pattern so you get ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ the pattern
matches again. Bump up the min boundary of this new pattern to
^(?=\D+\d)[a-z][a-z0-9]{4,7}$ Bugaboo! The pattern fails again.
Now I don't have access to the source
code so this is just supposition but here's what I believe is
happening. I think the data examined by the lookahead is being
stored in a stack structure. But just looking at the patterns that
are working and failing it looks like once the lookahead is satisfied
values when a qualifier is encountered in the consuming portion of
the pattern, the lookahead's match is reconsidered with the minimum
boundary of the qualifier in the lookahead popped off the stack of
the lookahead's match. Let's look at the first pattern that worked
^(?=\D*\d)[a-z][a-z0-9]{2,7}$
OK we'll start with after the lookahead
is matched. Now the lookahead is supposed to be non-consuming so the
pointer should still be at the beginning of the string.
^(?=\D*\d) matches “abc1” Now the
rest of the pattern matches normally until we get to the qualifier in
the consuming portion of the pattern we are looking for at least two
alphanumeric characters. At that point the lookahead match is
reconsidered, the lower bound being zero, nothing is remove for the
stack but the current character pointer now points to the character
after the lookahead's match value of “abc1”.The qualified part of
the pattern [a-z0-9]{2,7}$ can be satisfied by “23” so we get a
match
Now if we do the same thing with
^(?=\D*\d)[a-z][a-z0-9]{3,7}$ and apply the same logic the regex
fails because the qualifier in the consuming portion is look for at
least 3 characters and there aren't that many if it tries to satisfy
that part of the pattern after “abc1” in the test string.
Now let's look at
^(?=\D+\d)[a-z][a-z0-9]{3,7}$ with the same logic. The only
difference from the previous pattern is the lower boundary of the
lookahead qualifier. It's now 1. So if we pop 1 character of the
stack of the lookahead's match of “abc1” we get “abc” leaving
us with “123” to be matched by the consuming qualifier, which is
just enough.
Now take ^(?=\D+\d)[a-z][a-z0-9]{4,7}$
apply the same logic, now the pattern fails because the consuming
qualifier is look for as least 4 characters after the stack popping
of lookahead's match.
If you changed the lookahead qualifier
to {2,} the pattern would match. You can continue upping the qualifiers by one in the consuming portion to make the pattern not match then non-consuming part of the pattern to get the match so the behavior seems pretty
consistent with my theory which the consuming qualifier is pointing
to the end of the lookahead match and moving back the number of
character of the lookahead minimum boundary. It also seem to explain
the effect of my original encounter with the bug. It may as well as
why the workaround of testing the length first with another avoids
the bug because that consuming qualifier it that case is usually just
, though + seem to work too which doesn't quite fit but consuming
qualifiers with minimum boundaries of zero or one don't seem to be
effected in any case. In all of the above test cased those values
were below the minimal threshold of every successful test. However
the test cases above has only one lookahead that doesn't backtrack,
who know what influence backtracking, additional lookaheads or how
addition qualifiers in the consuming parts pattern would be
effected. Now while the test values support my theory I can't say for
sure that things are happening exactly the way I've laid out. The
actual mechanics may be different but whatever is happening under the
hood clearly pointers are being corrupted and the regex engine is
loosing it's place
The thing that was so confusing about
this was it only kicks in with a qualifier in the consuming portion
is encountered but if there was something between the lookahead and
the qualified portion it match normally. So this is something it's
really hard to make test cases for because you get these ghost value
popping up latter on in the test than I expected. Not to mention the
pattern itself is correct so even with a tool that will let you step
through the matching you don't get this behavior unless that tool is
using VBScript's regex engine and I haven't seen such a tool for that
engine.
If you are using lookaheads with
JavaScript client-side then you are going to be susceptible to the
bug in IE because it will use the Jscript engine. And while you
should always validate server-side if VBSCript or Jscript is your
server-side language you are still at risk. So a platform like
classic ASP which uses both of those languages by default is at risk
client and server side, but a platform like PHP while it still suffer
the bug client-side for IE, should work correctly server-side which
is using a different regex engine. Same goes for non-web clients
using the JScript/VBScript DLL.
The workaround for the strong password
type of regex is when using lookaheads to include the upper boundary
test(s) before the no upper boundary test then use .* to consume
characters. The bound test should keep the pattern from running
forever. However depending on the complexity of your criteria this
may not always be an option but try it first anyway.
|
-
First off let me say I'm a bit over my
head here. Not regex part but host the language of the regex engine.
Many moons ago I posted a blog article
stating why you could not write a regex that validated an e-mail
address 100%. Well this is still true, however in that
posted I also stated that the pattern was so massive that it wasn't
worth using. This is also still true however I was made aware of a
flavor-specific syntax that reduces the regex from massive to very
large.
This regex is for the PCRE engine.
http://www.myregextester.com/?r=337
Though from what I've read this will work for PHP too. Now I don't know Perl or PHP or what
minimum version of PCRE supports this syntax. That being the case I
also don't how well it performs. I wrote the original version using
the .Net syntax and not only was the regexPublish massive, which is one reason I never posted it but the
performance was terrible. Given that most people want to use this
type regex to validate a data entry field, the pattern was overkill.
In fact I recommend that you don't use this, except to learn from.
The PCRE version may perform better but I don't have the means or
time to test, so use at your own risk. For simple field validation
even this is still overkill. For a large text file performance
may suffer horribly. Most likely you aren't going to want to use
this pattern as it is too large for simple test and performs poorly
for large test.
When I see people asking for Email
regex, I point out that perfect validation is not possible. And when
I see so-call email validating regex that are only about 50
characters long, it makes me chuckle. This pattern is probably to
most compact version of a RFC 2822 address regex you'll find and it
is still huge. Ports to other regex engines not supporting the
recursive syntax will easily be 4x as large as my .Net version was.
The above pattern does the RFC Spec up to
the address-spec, which pretty much what people are thinking about
when they are saying Email address.
It not to hard to take to it up a few
more level in the spec using this syntax
RFC 2822 mailbox :
http://www.myregextester.com/?r=338
but like I said it likely won't perform
well enough to be useful. The two patterns I've linked to I've
wrapped in anchors so they are just matching against the whole
string. Searching for a string within a larger body, without anchors will probably
degrade performance very fast. But if any of you PHP or Perl gurus want to stress test this beast, have fun. Maybe it's not as bad as I think it may be.
Save and Continue Writing
|
-
This is a C# 2.0 enhancement of a C# port of YUI Compressor's CSS minification code
I got a little carried away with ideas
for this, they were all regex based which really is what motivated me
to work on it. However after I thought I was done I learned not
everything worked. It did what I wanted it to do but what I wanted
wasn't the correct thing. I really should have just stopped with my
original ideas.
The last idea for my original changes was to take 2 or more
individual subset properties and write them in shorthand notation of
the main property they were a subset of. Well I got that to working.
But upon testing I learned something new about CSS that I didn't
know. Basically that what I was doing could alter the behavior of the
presentation. Which was disappointing because I put a lot of energy
into getting the results I was after.
So it looked as all of that code was
going to go to waste. But there was one scenario that what I was
trying to do was alright. So the code wasn't completely wasted. The
one scenario was if all the subset properties are declared then
combining them is fine. I didn't bother changing the regexes I wrote
for this but I cleaned up some of the code. Though it would have
worked as is some of the things being checked were now unnecessary.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;
namespace CSSMinify
{
class CSSMinify
{
public static Hashtable shortColorNames = new Hashtable();
public static Hashtable shortHexColors = new Hashtable();
public static string Minify(string css)
{
return Minify(css, 0);
}
public static string Minify(string css, int columnWidth)
{
// BSD License http://developer.yahoo.net/yui/license.txt
// New css tests and regexes by Michael Ash
createHashTable();
MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
MatchEvaluator shortColorNameDelegate = new MatchEvaluator(ShortColorNameMatchHandler);
MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
css = RemoveCommentBlocks(css);
css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
/* Remove the spaces before the things that should not have spaces before them.
But, be careful not to turn "p :link {...}" into "p:link{...}"
*/
css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1"); // Remove the spaces after the things that should not have spaces after them.
css = Regex.Replace(css, @"([^;}])}", "$1;}"); // Add the semicolon where it's missing.
css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
//New test
//Font weights
css = Regex.Replace(css, @"(?<=font-weight:)normal\b", "400");
css = Regex.Replace(css, @"(?<=font-weight:)bold\b", "700");
//Thought this was a good idea but properties of a set not defined get element defaults. This is reseting them. css = ShortHandProperty(css);
css = ShortHandAllProperties(css);
//css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
// if all 4 parameters the same unit make 1 parameter
css = Regex.Replace(css, @"(?<!background-position\s*):\s*(inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
css = Regex.Replace(css, @":\s*((inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
css = Regex.Replace(css, @":\s*((?:(?:inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
//// if has 3 parameters and top unit = bottom unit make 2 parameters
//css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, "background-position:0;", "background-position:0 0;");
css = Regex.Replace(css, @"(:|\s)0+\.(\d+)", "$1.$2");
// Outline-styles and Border-sytles parameter reduction
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
// Outline-color and Border-color parameter reduction
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);
// Shorten colors from rgb(51,102,153) to #336699
// This makes it more likely that it'll get further compressed in the next step.
css = Regex.Replace(css, @"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
// Replace hex color code with named value is shorter
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate, RegexOptions.IgnoreCase);
// Remove empty rules.
css = Regex.Replace(css, @"[^}]+{;}", "");
//Remove semicolon of last property
css = Regex.Replace(css, ";(})", "$1");
if (columnWidth > 0)
{
css = BreakLines(css, columnWidth);
}
return css;
}
private static string RemoveCommentBlocks(string input)
{
int startIndex = 0;
int endIndex = 0;
bool iemac = false;
startIndex = input.IndexOf(@"/*", startIndex);
while (startIndex >= 0)
{
endIndex = input.IndexOf(@"*/", startIndex + 2);
if (endIndex >= startIndex + 2)
{
if (input[endIndex - 1] == '\\')
{
startIndex = endIndex + 2;
iemac = true;
}
else if (iemac)
{
startIndex = endIndex + 2;
iemac = false;
}
else
{
input = input.Remove(startIndex, endIndex + 2 - startIndex);
}
}
startIndex = input.IndexOf(@"/*", startIndex);
}
return input;
}
private static String RGBMatchHandler(Match m)
{
int val = 0;
StringBuilder hexcolor = new StringBuilder("#");
for (int index = 1; index <= 3; index += 1)
{
val = Int32.Parse(m.Groups[index].Value);
hexcolor.Append(val.ToString("x2"));
}
return hexcolor.ToString();
}
private static string BreakLines(string css, int columnWidth)
{
int i = 0;
int start = 0;
StringBuilder sb = new StringBuilder(css);
while (i < sb.Length)
{
char c = sb[i++];
if (c == '}' && i - start > columnWidth)
{
sb.Insert(i, '\n');
start = i;
}
}
return sb.ToString();
}
private static string ReplaceNonEmpty(string inputText, string replacementText)
{
if (replacementText.Trim() != string.Empty)
{
inputText = string.Format(" {0}", replacementText);
}
return inputText;
}
private static string ShortColorNameMatchHandler(Match m)
{
// This function replace hex color values named colors if the name is shorter than the hex code
string returnValue = m.Value;
if (shortColorNames.ContainsKey(m.Groups["hex"].Value))
{
returnValue = shortColorNames[m.Groups["hex"].Value].ToString();
}
return returnValue;
}
private static string ShortColorHexMatchHandler(Match m)
{
//This function replaces named values with there shorter hex equivalent
return shortHexColors[m.Value.ToString().ToLower()].ToString();
}
private static void createHashTable()
{
//Color names shorter than hex notation. Except for red.
shortColorNames.Add("F0FFFF".ToLower(), "Azure".ToLower());
shortColorNames.Add("F5F5DC".ToLower(), "Beige".ToLower());
shortColorNames.Add("FFE4C4".ToLower(), "Bisque".ToLower());
shortColorNames.Add("A52A2A".ToLower(), "Brown".ToLower());
shortColorNames.Add("FF7F50".ToLower(), "Coral".ToLower());
shortColorNames.Add("FFD700".ToLower(), "Gold".ToLower());
shortColorNames.Add("808080".ToLower(), "Grey".ToLower());
shortColorNames.Add("008000".ToLower(), "Green".ToLower());
shortColorNames.Add("4B0082".ToLower(), "Indigo".ToLower());
shortColorNames.Add("FFFFF0".ToLower(), "Ivory".ToLower());
shortColorNames.Add("F0E68C".ToLower(), "Khaki".ToLower());
shortColorNames.Add("FAF0E6".ToLower(), "Linen".ToLower());
shortColorNames.Add("800000".ToLower(), "Maroon".ToLower());
shortColorNames.Add("000080".ToLower(), "Navy".ToLower());
shortColorNames.Add("808000".ToLower(), "Olive".ToLower());
shortColorNames.Add("FFA500".ToLower(), "Orange".ToLower());
shortColorNames.Add("DA70D6".ToLower(), "Orchid".ToLower());
shortColorNames.Add("CD853F".ToLower(), "Peru".ToLower());
shortColorNames.Add("FFC0CB".ToLower(), "Pink".ToLower());
shortColorNames.Add("DDA0DD".ToLower(), "Plum".ToLower());
shortColorNames.Add("800080".ToLower(), "Purple".ToLower());
shortColorNames.Add("FA8072".ToLower(), "Salmon".ToLower());
shortColorNames.Add("A0522D".ToLower(), "Sienna".ToLower());
shortColorNames.Add("C0C0C0".ToLower(), "Silver".ToLower());
shortColorNames.Add("FFFAFA".ToLower(), "Snow".ToLower());
shortColorNames.Add("D2B48C".ToLower(), "Tan".ToLower());
shortColorNames.Add("008080".ToLower(), "Teal".ToLower());
shortColorNames.Add("FF6347".ToLower(), "Tomato".ToLower());
shortColorNames.Add("EE82EE".ToLower(), "Violet".ToLower());
shortColorNames.Add("F5DEB3".ToLower(), "Wheat".ToLower());
// Hex notation shorter than named value
shortHexColors.Add("black", "#000");
shortHexColors.Add("fuchsia", "#f0f");
shortHexColors.Add("lightSlategray", "#789");
shortHexColors.Add("lightSlategrey", "#789");
shortHexColors.Add("magenta", "#f0f");
shortHexColors.Add("white", "#fff");
shortHexColors.Add("yellow", "#ff0");
}
private static string ShortHandAllProperties(string css)
{
/*
* This function searchs for properties specifying all the individual properties of a property type
* and reduces it to a single property use shorthand notation
*/
Regex reCSSBlock = new Regex("{[^{}]*}");
Regex reTRBL1 = new Regex(@"(?<fullProperty>(?:(?<property>padding)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL2 = new Regex(@"(?<fullProperty>(?:(?<property>margin)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL3 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:color)))\s*:\s*(?<unit>[#\w.]+);?", RegexOptions.IgnoreCase);
Regex reTRBL4 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:style)))\s*:\s*(?<unit>none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset);?", RegexOptions.IgnoreCase);
Regex reTRBL5 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:width)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
Regex reListStyle = new Regex(@'list-style-(?<style>type|image|position)\s*:\s*(?<unit>[^};]+);?', RegexOptions.IgnoreCase);
Regex reFont = new Regex(@"font-(?:(?:(?<fontProperty>family\b)\s*:\s*(?<fontPropertyValue>(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22)(?:\s*,\s*(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22))*)\b)|
(?:(?<fontProperty>style\b)\s*:\s*(?<fontPropertyValue>normal|italic|oblique|inherit))|
(?:(?<fontProperty>variant\b)\s*:\s*(?<fontPropertyValue>normal|small-caps|inherit))|
(?:(?<fontProperty>weight\b)\s*:\s*(?<fontPropertyValue>normal|bold|(?:bold|light)er|[1-9]00|inherit))|
(?:(?<fontProperty>size\b)\s*:\s*(?<fontPropertyValue>(?:(?:xx?-)?(?:small|large))|medium|(?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex))\b)|inherit|\b0\b)))\s*;?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace));
Regex reBackGround = new Regex(@"background-(?:
(?:(?<property>color)\s*:\s*(?<unit>transparent|inherit|(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)))|
(?:(?<property>image)\s*:\s*(?<unit>none|inherit|(?:url\s*\([^()]+\))))|
(?:(?<property>repeat)\s*:\s*(?<unit>no-repeat|inherit|repeat(?:-[xy])))|
(?:(?<property>attachment)\s*:\s*(?<unit>scroll|inherit|fixed))|
(?:(?<property>position)\s*:\s*(?<unit>((?<horizontal>left | center | right|(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+(?<vertical>top | center | bottom |(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))))|
((?<vertical>top | center | bottom )\s+(?<horizontal>left | center | right ))|
((?<horizontal>left | center | right )|(?<vertical>top | center | bottom ))))
);?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture));
MatchCollection mcBlocks = reCSSBlock.Matches(css);
foreach (Match mBlock in mcBlocks)
{
string strBlock = mBlock.Value;
HasAllPositions(reTRBL1, ref strBlock);
HasAllPositions(reTRBL2, ref strBlock);
HasAllPositions(reTRBL3, ref strBlock);
HasAllPositions(reTRBL4, ref strBlock);
HasAllPositions(reTRBL5, ref strBlock);
HasAllListStyle(reListStyle, ref strBlock);
HasAllFontProperties(reFont, ref strBlock);
HasAllBackGroundProperties(reBackGround, ref strBlock);
css = css.Replace(mBlock.Value, strBlock);
}
return css;
}
private static void HasAllBackGroundProperties(Regex re, ref string CSSText)
{
{
MatchCollection mcProperySet = re.Matches(CSSText);
int z = 5;
if (mcProperySet.Count == z)
{
int y = 0;
for (int x = 0; x < z; x = x + 1)
{
switch (mcProperySet[x].Groups["property"].Value)
{
case "color":
y = y + 1;
break;
case "image":
y = y + 2;
break;
case "repeat":
y = y + 4;
break;
case "attachment":
y = y + 8;
break;
case "position":
y = y + 16;
break;
}
}
if (y == 31)
{
CSSText = ShortHandBackGroundReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static void HasAllFontProperties(Regex re, ref string CSSText)
{
{
MatchCollection mcProperySet = re.Matches(CSSText);
int z = 5;
if (mcProperySet.Count == z)
{
int y = 0;
for (int x = 0; x < z; x = x + 1)
{
switch (mcProperySet[x].Groups["fontProperty"].Value)
{
case "style":
y = y + 1;
break;
case "variant":
y = y + 2;
break;
case "weight":
y = y + 4;
break;
case "size":
y = y + 8;
break;
case "family":
y = y + 16;
break;
}
}
if (y == 31)
{
CSSText = ShortHandFontReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static void HasAllListStyle(Regex re, ref string CSSText)
{
{
int z = 3;
MatchCollection mcProperySet = re.Matches(CSSText);
if (mcProperySet.Count == z)
{
int y = 0;
for (int x = 0; x < z; x = x + 1)
{
switch (mcProperySet[x].Groups["style"].Value)
{
case "type":
y = y + 1;
break;
case "image":
y = y + 2;
break;
case "position":
y = y + 4;
break;
}
}
if (y == 7)
{
CSSText = ShortHandListReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static void HasAllPositions(Regex re, ref string CSSText)
{
{
MatchCollection mcProperySet = re.Matches(CSSText);
if (mcProperySet.Count == 4)
{
int y = 0;
for (int x = 0; x < 4; x = x + 1)
{
switch (mcProperySet[x].Groups["position"].Value)
{
case "top":
y = y + 1;
break;
case "right":
y = y + 2;
break;
case "bottom":
y = y + 4;
break;
case "left":
y = y + 8;
break;
}
}
if (y == 15)
{
CSSText = ShortHandReplaceV2(mcProperySet, re, CSSText);
}
}
}
}
private static string ShortHandFontReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
{
/*
* This Function replaces the individual font properties with a single entry
* */
string strFamily, strStyle, strVariant, strWeight, strSize;
Regex reLineHeight = new Regex(@"line-height\s*:\s*((?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex)\b)?)|normal|inherit);?", RegexOptions.IgnoreCase);
strFamily = string.Empty;
strStyle = string.Empty;
strVariant = string.Empty;
strWeight = string.Empty;
strSize = string.Empty;
string strStyle_Variant_Weight = string.Empty;
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups[""].Value)
{
case "family":
strFamily = string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
break;
case "size":
if (reLineHeight.IsMatch(InputText))
{
Match m = reLineHeight.Match(InputText);
if (m.Groups[1].Value != "normal")
{
strSize = String.Format("/{0}", m.Groups[1].Value);
}
InputText = reLineHeight.Replace(InputText, string.Empty);
}
strSize = string.Format(" {0}{1}", mProperty.Groups["fontPropertyValue"].Value, strSize);
if (strSize == "medium")
{
strSize = string.Empty;
}
break;
case "style":
case "variant":
case "weight":
if (mProperty.Groups["fontPropertyValue"].Value != "normal")
{
strStyle_Variant_Weight += string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
} break;
}
}
string strShortcut;
string strProperties = string.Format("{0}{1}{2};", strStyle_Variant_Weight, strVariant, strWeight, strSize, strFamily);
strShortcut = string.Format("font:{0}", strProperties.Trim());
string strNewBlock = re.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
private static string ShortHandBackGroundReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
{
/*
* This Function replaces the individual background properties with a single entry
* */
string strColor, strImage, strRepeat, strAttachment, strPosition;
strColor = string.Empty;
strImage = string.Empty;
strRepeat = string.Empty;
strAttachment = string.Empty;
strPosition = string.Empty;
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups["property"].Value)
{
case "color":
if (mProperty.Groups["unit"].Value != "transparent")
{
strColor = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "image":
if (mProperty.Groups["unit"].Value != "none")
{
strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "repeat":
if (mProperty.Groups["unit"].Value != "repeat")
{
strRepeat = string.Format(" {0}", mProperty.Groups["unit"].Value);
} break;
case "attachment":
if (mProperty.Groups["unit"].Value != "scroll")
{
strAttachment = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "position":
if (mProperty.Groups["unit"].Value != "0% 0%")
{
strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
}
}
string strShortcut;
string strProperties = string.Format("{0}{1}{2}{3}{4};", strColor, strImage, strRepeat, strAttachment, strPosition);
strShortcut = string.Format("background:{0}", strProperties.Trim());
string strNewBlock = re.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
private static string ShortHandReplaceV2(MatchCollection mcProperySet, Regex reTRBL1, string InputText)
{
// Replace method for regexes used in ShortHand property method for properties with top, right, bottom and left sub properties.
string strTop, strRight, strBottom, strLeft;
strTop = string.Empty;
strRight = string.Empty;
strBottom = string.Empty;
strLeft = string.Empty;
string strProperty;
strProperty = string.Format("{0}{1}", mcProperySet[0].Groups["property"].Value, mcProperySet[0].Groups["property2"].Value);
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups["position"].Value)
{
case "top":
strTop = mProperty.Groups["unit"].Value;
break;
case "right":
strRight = mProperty.Groups["unit"].Value;
break;
case "bottom":
strBottom = mProperty.Groups["unit"].Value;
break;
case "left":
strLeft = mProperty.Groups["unit"].Value;
break;
}
}
string strShortcut = string.Format("{0}:{1} {2} {3} {4};", strProperty, strTop, strRight, strBottom, strLeft);
string strNewBlock = reTRBL1.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
private static string ShortHandListReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
{
/*
* This Function replaces the individual list properties with a single entry
* */
string strType, strPosition, strImage;
strType = string.Empty;
strPosition = string.Empty;
strImage = string.Empty;
foreach (Match mProperty in mcProperySet)
{
switch (mProperty.Groups["style"].Value)
{
case "type":
if (mProperty.Groups["unit"].Value != "disc")
{
strType = mProperty.Groups["unit"].Value;
}
break;
case "position":
if (mProperty.Groups["unit"].Value != "outside")
{
strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
case "style":
if (mProperty.Groups["unit"].Value != "none")
{
strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
}
break;
}
}
string strShortcut = string.Format("list-style:{0}{1}{2};", strType, strPosition, strImage);
string strNewBlock = re.Replace(InputText, "");
strNewBlock = strNewBlock.Insert(1, strShortcut);
return strNewBlock;
}
}
}
|
-
OK, there regexes were discussed in the previous post this is mostly just their application.
This is a C# 2.0 enhancement of a C# port of YUI Compressor's CSS minification code
Since I was doing this is C# I took full advantage of it's regex engine, namely using lookbehinds and delegates for some replaces.
Almost all the regexes after the "New Test" comment are the new or modified regexes from the ported version. There is also one new and two modified expressions before that comment. One of those modification is just a change in writing style, the other modifications are replacing some code but (hopefully) not functionality with a regex replace. The new regex replacements of course are the new compression enhancements.
There are also a couple of new regexes not mentioned in the previous post that match and replace some of the color values with an equivalent but a more concisely written value. The replace the color "red" is a straight replace but the other colors require some code evaluation and are using delegates.
I've done some very limited testing but as I mentioned in the previous post most of the CSS I've written doesn't have some of the new things I was searching for. I could add them for a test (which I did) but that won't catch any problems they my cause to the actual CSS application since I wasn't really using the test values. So the source code is now available for beta testing. Test early and often before committing to use it. I'm willing to fix any minor bugs for things I may have overlook but if a particular replace is problematic it's easy enough to comment out the offender and use the rest.
And as was mentioned in the comments of the previous post any generated content that looks like CSS may get stepped on so be aware of that.
And also that all licenses for previous versions still apply. UPDATE 2008-04-27
After a little more testing I discovered one of the replaces I was doing can alter how the CSS is processed. So I have just crossed out the functions and function call I've come up with a safer, though less likely to occur replacement.
using System;
using System.Collections;
using System.Collections.Generic;
using System.Globalization;
using System.Text;
using System.Text.RegularExpressions;
namespace CSSMinify
{
class CSSMinify
{
public static Hashtable shortColorNames = new Hashtable();
public static Hashtable shortHexColors = new Hashtable();
public static string Minify(string css)
{
return Minify(css, 0);
}
public static string Minify(string css, int columnWidth)
{
// BSD License http://developer.yahoo.net/yui/license.txt
// New css tests and regexes by Michael Ash
createHashTable();
MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
MatchEvaluator shortColorNameDelegate = new MatchEvaluator(ShortColorNameMatchHandler); MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
css = RemoveCommentBlocks(css);
css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
/* Remove the spaces before the things that should not have spaces before them.
But, be careful not to turn "p :link {...}" into "p:link{...}"
*/
css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1"); // Remove the spaces after the things that should not have spaces after them.
css = Regex.Replace(css, @"([^;}])}", "$1;}"); // Add the semicolon where it's missing.
css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
//New test
css = ShortHandProperty(css);
//css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
// if all 4 parameters the same unit make 1 parameter
css = Regex.Replace(css, @":\s*(0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
// if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
css = Regex.Replace(css, @":\s*((?:(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
//// if has 3 parameters and top unit = bottom unit make 2 parameters
//css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])| (?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
css = Regex.Replace(css,"background-position:0;", "background-position:0 0;");
css = Regex.Replace(css,@"(:|\s)0+\.(\d+)", "$1.$2");
// Outline-styles and Border-sytles parameter reduction
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
// Outline-color and Border-color parameter reduction
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);
// Shorten colors from rgb(51,102,153) to #336699
// This makes it more likely that it'll get further compressed in the next step.
css = Regex.Replace(css,@"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
// Replace hex color code with named value is shorter
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red",RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate,RegexOptions.IgnoreCase);
// Remove empty rules.
css = Regex.Replace(css,@"[^}]+{;}", "");
//Remove semicolon of last property
css = Regex.Replace(css, ";(})", "$1");
if (columnWidth > 0)
{
css = BreakLines(css, columnWidth);
}
return css;
}
private static string RemoveCommentBlocks(string input)
{
int startIndex = 0;
int endIndex = 0;
bool iemac = false;
startIndex = input.IndexOf(@"/*", startIndex);
while (startIndex >= 0)
{
endIndex = input.IndexOf(@"*/", startIndex + 2);
if (endIndex >= startIndex + 2)
{
if (input[endIndex - 1] == '\\')
{
startIndex = endIndex + 2;
iemac = true;
}
&n
| |
|