Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

  • Full Circle - JavaScript Date validation with regex

    Several year ago I decided to learn Regular Expressions. I had been skimming over an article about using regex for validation on web pages. It was at least a couple of weeks before I sat down an read the article in full. After I did I wanted to try to write a regex on my one but couldn't think of anything original. Eventually I wrote a date regex that validated leap years as well, something none of the other regex I had seen at the time did. In the spirit of the article that introduced regular expressions to me I that this was something that could be used with Web Page validation. But at the time I wasn't doing a whole lot of web page development. And what little web development I was doing, didn't require me to use or validate dates.

    I shared my regex on the web. I got a lot of request to do other date formats. At first I was reluctant to do that because my original pattern was troublesome to write, mostly because the text editor I was using didn't help me match parenthesis and I had a but of types to work out. Plus I always thought it would be easier to reformat the date into the format the regex was checking.

    Eventually someone else reverse engineer my regex into another format. Later I wrote patterns for other formats as I saw uses beyond web page validation. I expanded the patterns as I learn more about regex. For various reasons I eventually moved away from working on these type of patterns. I had done about all I could with them. Other than fixing a bug, generally caused by a typo, I didn't do much with these patterns.

    Recently I got another request for an alternate format for one of my old patterns. I referred them to my central depot of date regexes. But something else in the request reminded me of my original idea of handling other formats. So I decided to actually do it. So here it is in JavaScript

     

    function isValidDate(ds, format) {
        var reDate = /^(?!(?:10([-./])(?:0?[5-9]|1[0-4])\1(?:1582)))(?:(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\2|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\3))(?:(?!0000)\d{4})$|^(?:0?2(\/|-|\.)29\4(?:(?:(?:(?!000[04]|(?:(?:1[^0-6]|[2468][^048]|[3579][^26])00))(?:(?:(?:\d\d)(?:[02468][048]|[13579][26]))))|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\5(?:(?!0000)\d{4}))$/;
        switch(format) {
    	case "dmy":
    		return reDate.test(ds.toString().replace(/(\d?\d)(\D)(\d?\d)(\D)(\d{4})/, "$3$2$1$4$5"));
    	case "ymd":
    		return reDate.test(ds.toString().replace(/(\d{4})(\D)(\d?\d)(\D)(\d?\d)/, "$3$2$5$4$1"));
    	default:
    		return reDate.test(ds);
    	}
    }
    

     

     That's it,  The function returns true or false if the string passed is a valid date.

    isValidDate("11/1/2011")  // true

    isValidDate("31/10/2011") // false

    isValidDate("31/10/2011","dmy") // true

    isValidDate("2031/10/20","ymd") // true 

    The big regex is the date validator. I've talked about that enough in the past so I'm not going to go into detail about how it works just trust me that it does. Basics are 4-digits years, one or two digit months and days. You have a choice of three date separators a peiod (.), a hyphen (-) or a forward slash (/), Which ever one you chose you have to be consistent. There have been a few modifications from my original date regex to expand the time span it checks against and exclude missing day from the switch from Julian to Gregorian in 1592. but most apps won't push up against the old boundaries.

    The default format checked is dd/mm/yyyy

    The function take a optional second parameter that lets you specify other formats of yyyy-.mm-dd or dd-mm-yyyy but passing “ymd” or “dmy” respectively.

    If the second argument is passed the given alternate format is converted to the default format for testing by the use of one of two of simple regex replace operations. Nothing really fancy The month, day and years are moved around to get to mm-dd-yyyy format.

    For this I just kept it simple and handle the most common numerical date formats. For general webpage validation this function is perfectly viable. You could have taken my date regexes for all the various format and used each on base on the format. The advantage of just using one date regex is maintenance is easier. Not that the pattern needs any maintenance but if it did there is only one complex regex to deal with instead of three. Also with one regex you can't play around with other checks by using the match. I did that while playing around with this but not including that here. And while writing this I thought of a couple other things I could build off of this.

    Anyway there you have it. Three regexes, one scary two simple, and a few lines of code to validate dates in 3 formats.


  • Got (X)HTML? Use the DOM



    By far the most common request I see in people wanting regex help is someone wanting to use a regex to parse HTML.  Generally I ignore those questions.  If I do respond, my response is “Don’t use a regex to parse HTML. Use the HTML DOM” Same is true for XML, with the XML DOM, but doing it for HTML is even worse.  Not that my advice stops people from trying to do it anyway but giving them proper warning I feel no remorse in the self-inflicted pain they illicit ignoring my advice.

    You may ask why I advise people against using a regex for this task?  First full disclosure, I have, do, and will use regexes to parse HTML. But this is a case of “Do as I say, not as I do.”  Why is it OK for me but not for you? you may ask.   Simple, I am very knowledgable of both Regex and HTML. If I do embark on such a perilous task,  I know exactly what I’m getting into, what the risk are and what pitfalls to expect.  In most case I am the author of the HTML so I know how the markup will be constructed. Most people asking for a regex to parse HTML are expert in neither and have no idea of what to prepare for.  Which is reason one I tell people not to use regex for HTML.  It is not as simple as you want it to be.   Most people knowledgeable in regex will tell you to never use are regex for HTML.  I won’t go that far but if someone needs help to construct such a regex then they shouldn’t be using a regex for this task.  If you can write it yourself, you still shouldn’t do it but you are the only one who has to deal with the pitfalls so you are the only one who has to suffer.

    Now I’ve seen other article advising not to use regex against HTML, so I never really considered writing about it.  I thought it had been covered enough.  But since people keep asking for these types of pattern I thought I should address it as well. And also I have insight that most of those other articles don’t address.  

    The biggest issue by far is that people asking for these patterns, as well as most of those who will trying to write patterns for those who ask, is that they don’t know HTML very well.  HTML is one of the languages a lot of people think they know better than they actually do.  And that’s where the problem start. For the last few year I'm been learning about Web Standards, which in a nutshell includes writing valid HTML and following best practices of web development.  One of the things I learn in do this is that most HTML in the real world is poorly written.  This compounds the problem of using regex against HTML, since regexes are all about finding patterns.  So even if you know the rules of  valid HTML you have no guarantee the HTML you are working with follows those guidelines.

    One of the caveats of the Posting Guidelines is to show the original text and not a made up sample.  Although you shouldn’t be trying to use a regex on HTML it points out why this guideline is in place.  The majority of people asking this type of question will either make up the sample HTML or show a simplified version of the markup highlighting the beginning and the end of what they want to match and writing something like “blah blah blah” for the content in between.  The problem here is a lack of understanding of how regex works and how HTML can be written.

    Commonly the request is to match a certain div or table.  The problem here is they assume that these elements won’t be nested with child elements of the same type.  A table can contain other tables, which very common in pages that use tables for layout.  And just about any page that has div elements will contain nested divs.
    Often the request is I want to match a particular div and they have some phony HTML  like

    <div id=”findMe”>
        blah

    blah

    blah

    </div>

    And some kind soul may write them a regex pattern that matches that sample.  The problem is real HTML could break any pattern they write.  The person asking for help incorrectly assumes that a regex will understand HTML and know the </div> is the closing tag of <div id=”findMe”>.  The person answering may think that all that is need is after finding the open div just search until the close div occurs.  This shows how made up data and regex construction don’t go together.  The pattern to find a block with no nested elements is simple.  The pattern to find one that have nested element get significantly harder the deeper the nesting.  The “blah blah blah” could easily be more mark up, include any number of div’s.  There could easily be eight </div> before the one that closes <div id=”findMe”>.
    If the actual source is

    <div id=”findMe”>
        <div>content <div> More content</div> continued </div>

    <div>content</div>

    <div>content More content</div>

    </div>
    Most regexes will choke trying to correctly find findMe’s closing tag, especially if the nesting depth is unknown.  
    While a  powerful regex engine may be able to balance to tags that doesn’t mean you are in the clear.

    Of course the other assumption that leads to trouble is that the HTML is either well written or well formed.  Most HTML out in the wild is not well written, in that it wouldn’t pass validation. Run most pages through the W3C’s validator http://validator.w3.org/   Don’t be shocked if the error count is in triple digits.  Unlike a web browser a regex isn’t a HTML parser so it isn’t going to catch any HTML errors any of which may cause your pattern to fail.  A web browser is very fault tolerant so it can handle errors and adjust.  A regex won’t.
    <div id=”findMe”>
        blah

    blah

    blah

    <div>

    Oops forgot to close the previous Div

    </div>

    So even if you managed to write pattern that could handle nested tags of the same type, which itself is tough enough.  Now you have account for unclosed tags as well.  Which is one of many very likely errors you’d need to account for.

    Even if the HTML is well written and passes validation doesn’t mean it is well formed.  Unlike XML or XHTML, HTML doesn’t have to be well formed to be valid.  The assumption by many is that every open tags has a closing match.  Sorry that simply isn’t the case.  Several tags have optional closing tags.  While the above snippet isn’t valid the below snippet is

    <p id=”findMe”>
        blah

    blah

    <p> More stuff</p>

    Some may think they can just test for either the first occurrence of </p> or <p> after <p id=”findMe”> to find the complete<p id=”findMe”> element but what if the markup is so

    <p id=”findMe”>
        <span>blah</span>

    blah

    <ul>
        <li>
        <li><p> More Stuff</p>
    </ul>
    <p> More stuff</p>

    or for a nested table

    <table id=”findMe”>
        <tbody>
            <tr><td><td><table>
                        <tr><td><td>

    <tr><td><td></tr>

                         </table>
            <tr><td><td>
            <tr><td><td>
    </table>

    The above samples are representing the source that could exist in an attempt to find the element with the id of “findMe”.

    Also formatting has to be considered.  What is the difference between these two snippets of mark up?
    <p class=demo id=test> Some Text</p>
    and
    <p
     id=”test”
     class=”demo”
    >
    Some Text
    </p>

    As far as a browser is  concerned they are the same. A regex would see them as two different strings.

    Another common thing I see is when people want use a regex against HTML they want to try to find an element based on an attribute. Several incorrect assumptions generally are made
    1. The attribute they are looking for will always be the first
    2. The order of the attribute are fixed
    3. That the attributes are always quoted
    4. All attributes assign a value.


    None of the above have to be true for valid HTML
    Take the following and assume one wanted to find all the elements with findMe attributes

    <p class=”findMe”>
        blah

    blah

    <p> More stuff</p>
    <p class=”hello findMe world”>
        blah
        <span class=’findMe also’>blah</span>
    <p> More stuff</p>
    <p class=findMe>
        blah

    blah

    </p>
    <p class=”can you findMe”> More stuff</p>
    <p> More stuff</p>

    Trying to find the find each element with findMe in the class with a regex is way more trouble than it’s worth.

    Now while these example are contrived and you most likely won’t get such a mix of styles.  You could get one that you weren’t expecting.  And if you HTML was created by a team of developer you may get that mix after all.  And remember you can’t even assume the HTML is correctly written. So the moment you notice you’ve missed a style you didn’t account for you have to adjust your pattern which could be a huge nightmare.

    Another thing to consider especially when searching by attributes is the a regex can’t reuse consumed text. Consider the following
    <div class=”findMe”>
        <p>Blah</p>
        <p class=”findMe”> Some text <b class=”findMe”>to find</b></p>
        <p>Blah</p>
    </div>
    <div> Blah Blah Blah</div>

    Again if the task is to find all elements with a find me class. Let’s assume a regex can match the div element.  Once this is done the regex engine will proceed to find the next match after the div element’s closing tags.  It will not go back inside the div to find the class on the elements within the div.


    Instead of trying to hit a moving target with a regex, the HTML DOM would parse all the above samples, even the incorrect one, because that's it job.   Of course if there are errors the document may not be parse exactly as you intended but it will be correctly parsed. From there it is fairly easy to find the element you need. Reformatting the markup is very unlikely to break code using the the DOM.

    For example using the popular JavaScript library JQuery the follow snippet would find all all the elements with a findMe class.

    var foundThem = $(".findMe");


    Or in the versions of JavaScript supporting getElementByClassName

    var foundThem = document.getElementsByClassName('findMe');


    And quickly XML which has stricter syntax rules, well formed, all attributes quoted, may seem more regex friendly but there are other ways to parse XML.  Xpath or XQuery both more maintainable than a regex would be for this task and easier to use and understand in most cases.

    You can use a regex which may work, some of the time, or the DOM, something that should work all the time.  In the end the choice is yours.

  • Grading the Guidelines.

    Over at the Construction forum there is a sticky post which list the suggested posting guidelines for asking question on that forum. First let me say the guidelines were my idea however I didn't come up with them on my own. I had help. The idea sprang from the fact I was getting tired of repeatedly having to as each poster for the same bits of information. So the idea was to put these standard questions in a central highly visible location everyone could see. Where new poster could see what would be ask of them and add that information to their first post. The standard formula was


    P1: Ask vague question

    ME: Ask for more info

    P1: Provide more info (maybe)

    Me: Provide answer (maybe)


    This being this be the best case but often didn't go this smoothly. Variations would be

    ask for two piece of addition information get 1 (or none)

    Answer question, Get response of “Oh I forgot about this” or “but what about this unmentioned case”

    Where step two and three could repeat multiple times.


    The point of the posting guidelines was to eliminated the extra back and for so you get down to

    P1: Ask detailed quest

    Me: Answer

    P1: Thanks (maybe)


    Now if I were the grade how successful the posting guidelines have been I have to give two grades.

    Grade 1 (Posters asking questions): D-


    Unsatisfactory passing. For the most part people asking the questions won't follow the posting guidelines. So in those cases nothing really changes. You still end up going back and forth until you get the information you need to properly answer the question. This is a waste of everyone's time, mostly for the person asking the question. For those asking the question, it just delays you getting an answer especially if the person responding is on a different sleep cycle than you. You ask a question, hours later someone on the other side of the planet starting their day responds “Need more information”. However you've signed off for the day. So the next day your respond but now the other person has signed off for the day (or weekend) So now instead of having an answer waiting on, you end up wait several more hours or days for an answer.


    In some case getting the asked for information is like pulling teeth. I've always though it was strange how some people are so unwilling to help you, help them. Maybe it's because misery loves company and they want you to struggle with an answer in the same way they have been struggling with the problem. Sometime even after referring them to the guidelines people will respond but still not provide any of the information asked for. Sometimes they just rephrased their original query but add no new information or at least nothing satisfying any of the guideline request.


    Some even get hostile about it. One person stated he couldn't answer any of the guideline questions because they were new to regex and didn't know anything about it. Now that wasn't really a valid excuse since the questions don't require any regex knowledge. The only one that even implies you have any regex knowledge is “Show what you have tried” which implies you have attempted to write a pattern. But if you haven't tried anything, you can reply “I haven't trying anything”


    A lot of people with post questions with the mindset that you will answer while sitting in their lap. One of the reasons we are asking for this info is because we can not see what you are doing. We are not physically in the room with you. So statements like “I have a text file”, or “I'm using a text editor ...”

    aren't helpful. We are not there, we can't see your file, we can't see your editor, which we may have never even heard of.


    People will often post they are new to regex. Sometimes I think the feel this excuses them for needing to follow the guidelines. In fact it's just the opposite. They are the one who need to be following the guidelines the closest. We've all been in situations where we didn't understand something so much so we didn't even know what questions to ask. Or in this case what information is important and what is too much information. The guidelines cover the most crucial information needed. The more of this info provided the easier to answer your question becomes. The less provided the more difficult to answer your question becomes, irregardless if you are a regex noob or a vet.


    The other thing, that I find annoying is people asking for help saying it's not important for the people answering to know what language they are using, or say it doesn't matter what language platform the use to supply an answer. Why do you think we would ask for unimportant information? The same is true for all the guideline questions. Do those asking the questions think these things were asked just because someone was nosy? Everything asked is so that you can get the best answer as quickly as possible. If you don't understand why the question matters, ask! But don't tell the people trying to help you what parts of the information they are asking for is important.




    I said the the guidelines got two grades. The second if for people actually answering the questions. And for them the Guidelines get an A-(exceptional). The guidelines have been around a few years now even before some the active members providing the answers joined but in a lot of case I see them referring people to them with no prompting by myself. Not everyone answering reference the guidelines. Despite the things I just mentioned it is possible to provide an answer following none of the guidelines. It involves guessing and luck. Even if one ask a vague question, someone might be able to guess what they actually wanted and answer on the assumption and in some cases will be right. But that's mostly luck. Luck that they are familiar with the data you are working with. Just like above “sitting in your lap” people tend to assume that everyone is as familiar with the data or tools they are using as they themselves are. If you are lucky if that is true but nothing guarantees that. Plus a host of other assumptions have to be made and guessed correctly. Do you really want to rely on dumb luck every time?

     

    A simple example

    Q: I need make sure a sting contains only letters, what is the regex for that?

    A: ^[a-zA-Z]*$


    Assumptions made:

    1. By letters the question referred to letters of the English alphabet only

    2. Spaces are not allowed

    Now that pattern may be exactly what the person asking wanted. But what if it wasn't. People asking these questions come from all over the world, English may not be their first language. The regex may not be intended to used against English only input.


    The more common and actual request I see is when people say they need there regex to make “Special Characters”. Now maybe I was sick the day that they announced exactly which characters this term applies to and why they are special but when people ask for this there doesn't seem to be a consensus on which characters this includes. Now I could assume they meant any non-alphanumeric character, or even just non-alphanumeric characters on a keyboard. Of course this assumes we have the same keyboard layout. But often people only want a few of these characters. Which ones? Well that's part of the problem it varies. Even if just limited to the keyboard which characters are special or how many characters they are wanting differs by person.


    One of the things that really tell that those answering question are good with the guidelines is when you see a person was answering questions or trying to that provided incomplete information get frustrated by the back and forth of such questions and start referring newer such post to the posting guidelines.


    As new regex pros joined the board and start answering questions on a couple of occasions I have asked if they had any changes, additions or deletions to they thought should be added the to guidelines. The common answer is that the guidelines are good as is. The only real addition suggested were adding a Frequently Ask Question (FAQ) which while a good Idea since some question get asked repeatedly, does provide useful information for answering questions. Instead it would be more for preventing questions from being asked again. Only thing is the questions to vary slightly, like the afore mentioned “special characters” So the questions would likely still get asked especially considering the existing guidelines get ignored.


    The only actual change from the first posting of the guidelines is the additional of a example of a badly asked question and template of a well asked question.


    The actual guidelines can be found here http://regexadvice.com/forums/thread/60465.aspx but a quick summation of the 4 guidelines are


    1. What programming language/Application are you using?

    2. What are you trying to do?

    3. What have you tried to do?

    4. Show the actual text you are working with.


    Now those who say they can't answer any of these question because they don't know regex, a probably are about to get fired from their job.

    1. What programming language/Application are you using?- If you are a programmer and you don't know the name language you are doing your work in you are definitely in the wrong job. If you are not a programmer or just using an application, how can you not know the name of the application you are using or planning to use. How did/will you launch it, or even find it on your computer? How do you know it will even allow you to use a regex?

    2. What are you trying to do? - If you can't answer this, well... You are not being asked about your process you are being asked for your goal.  If you don't know your own goal why are you even asking others for help?

    3. What have you tried to do? - Again either you tried something or didn't try anything.  If you have tried something please show it. If you haven't tried anything just say so.

    4. Show the actual text you are working with – Aside from security concerns which can be addressed, this basically copy and paste.

    Why do we want to know this information?


    1. Regular Expressions are not uniform in all case. There are differences in syntax, features offered and even how features work that vary from host environment. And we aren't talking slight variations. In some cases the syntax of one regex engine is over 50% different from another. All pattern are not portable. Knowing the hosting environment makes it possible to know what is available regex-wise.

    2. Knowing the goal make giving an answer easier, if there is an answer. In some cases the goal may not be achievable with a regex or at all. Some solution require multiple patterns, some require programming code. Some goals are better suited for a non-regex solution

    3. Samples of what has been tried provide, insight to host environment, level of regex knowledge or knowledge of host environment. Code samples would show the language being used. Some attempts show misunderstanding of regex. Some attempts show syntax or logic errors in the host language code. Some people were close to getting the pattern right themselves. Others have had perfectly fine regex pattern but their host language code was wrong. So there is nothing with the regex that needs fixing there code needed to be corrected. Or even the pattern itself is fine but needs to be adjusted for the host language.

    4. This is very important but I have already covered that in another post.


    Generally people incorrectly assume that if they make their question as minimal as possible it helps to make it easier to answer. As far as regex goes the opposite is true. The more details, the easier regex questions are to answer. Minimal regex questions often lead to more questions. Detailed regex questions often lead to quick answers.


  • Looking again at the Lookahead bug

    A few years ago a wrote about about a bug in Internet Explorer's Regex engine that affected patterns with lookaheads. Well the bug came back in the form of a question on RegexAdvice.com. It too was a password regex, though not as complex as the previous pattern that introduced me to this bug.

    The first pattern had three conditions that were being tested for with lookaheads.

    ^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15}$

    With the current pattern only one lookahead was being used.

    ^(?=.*?\d)[a-z][a-z0-9]{5,7}$

    In both patterns the pattern had a min and max length. In to original attempts of both the length was being checked after the lookahead(s) test. While this is perfectly fine in a non VBScript/JScript world, this is were the bug kicks in those regex engines. Actually it's probably the same regex engine for both languages, which is probably why it only effects IE it's the only browser that uses VBScript and Jscript natively. I don't recall in my original testing for this bug if I tested it server-side given my previous blog comments, mostly likely I only tested client-side. However the recent question it was failing server-side so it's not actually in the browser but more likely the DLL for those languages. Anyway the previous blog article covered the behavior that was happening.

    Steve Levithan looked much closer at the problem in general and discussed it on his blog. He came to the conclusion that the qualifiers with a minimum boundary of zero, within the lookahead were the culprit. He provides a couple of simple examples. I think he's partially right but I don't think the 1+ qualifiers are excluded from the problem. I think his provided examples were a little too simple for them to be effected.

    OK let look at the regex pattern in the recent question

    ^(?=.*?\d)[a-z][a-z0-9]{5,7}$

    The requirements were 6 to 8 alphanumeric (English) string that started with an alpha character, with at least one digit. Let's use “abc123” as the text string.

    Now the above pattern was supplied by one of the resident pros who frequent the message boards. Let's get pass the fact the pattern itself is correct and satisfies the requirements. I'm going to rewrite the pattern to

    ^(?=\D*\d)[a-z][a-z0-9]{5,7}$

    Now this is functionally the same, it's just within the lookahead the pattern is greedy instead of lazy and it will make the point I'm trying to make easier to see (I hope) without having to deal with backtracking. Now this pattern suffers from the bug in VBScript/JScript. Now Steven suggested the one or more qualifier (+) doesn't suffer from the problem but change the to a + , which doesn't effect the match because the first test after the lookahead is for an alpha so the lookahead placed as it is will always match at least one character with the \D* pattern. So now we have

    ^(?=\D+\d)[a-z][a-z0-9]{5,7}$

    which is also bitten by the bug.

    I first tried the same approach on my original attempt dealing with the bug, but no luck. I tried using the + qualifier, still didn't work. Now the person posting the question stated without the lookahead the remaining pattern worked fine, excluding the at least one digit test. So I begin testing the part of patterns such as

    ^a-z][a-z0-9]{5,7}$

    ^(?=\D*\d)[a-z]

    ^(?=\D+\d)[a-z]

    are just a few attempts, all work as expected. It was only when I put the whole pattern back together that it begin failing. But after some more trial and error I think I've come across a pattern with the bug. Going back to my original examination of the problem I discovered that this pattern ^(?=\D*\d)[a-z][a-z0-9]{2,7}$ matches the test string “abc123” now this doesn't seem to quite fit with my original assessment of what was happening because in that pattern after the lookahead test it's just testing for a certain number of any characters, but it this pattern it's looking for a specific range of characters but if you up the min boundary on the last qualifier by one to ^(?=\D*\d)[a-z][a-z0-9]{3,7}$ the pattern fails again. Now if you change the zero to infinity qualifier in the lookahead to one to infinity qualifier in the modified pattern so you get ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ the pattern matches again. Bump up the min boundary of this new pattern to ^(?=\D+\d)[a-z][a-z0-9]{4,7}$ Bugaboo! The pattern fails again.

    Now I don't have access to the source code so this is just supposition but here's what I believe is happening. I think the data examined by the lookahead is being stored in a stack structure. But just looking at the patterns that are working and failing it looks like once the lookahead is satisfied values when a qualifier is encountered in the consuming portion of the pattern, the lookahead's match is reconsidered with the minimum boundary of the qualifier in the lookahead popped off the stack of the lookahead's match. Let's look at the first pattern that worked

    ^(?=\D*\d)[a-z][a-z0-9]{2,7}$

    OK we'll start with after the lookahead is matched. Now the lookahead is supposed to be non-consuming so the pointer should still be at the beginning of the string.

    ^(?=\D*\d) matches “abc1” Now the rest of the pattern matches normally until we get to the qualifier in the consuming portion of the pattern we are looking for at least two alphanumeric characters. At that point the lookahead match is reconsidered, the lower bound being zero, nothing is remove for the stack but the current character pointer now points to the character after the lookahead's match value of “abc1”.The qualified part of the pattern [a-z0-9]{2,7}$ can be satisfied by “23” so we get a match

    Now if we do the same thing with ^(?=\D*\d)[a-z][a-z0-9]{3,7}$ and apply the same logic the regex fails because the qualifier in the consuming portion is look for at least 3 characters and there aren't that many if it tries to satisfy that part of the pattern after “abc1” in the test string.

    Now let's look at ^(?=\D+\d)[a-z][a-z0-9]{3,7}$ with the same logic. The only difference from the previous pattern is the lower boundary of the lookahead qualifier. It's now 1. So if we pop 1 character of the stack of the lookahead's match of “abc1” we get “abc” leaving us with “123” to be matched by the consuming qualifier, which is just enough.

    Now take ^(?=\D+\d)[a-z][a-z0-9]{4,7}$ apply the same logic, now the pattern fails because the consuming qualifier is look for as least 4 characters after the stack popping of lookahead's match.

    If you changed the lookahead qualifier to {2,} the pattern would match. You can continue upping the qualifiers by one in the consuming portion to make the pattern not match then non-consuming part of the pattern to get the match so the behavior seems pretty consistent with my theory which the consuming qualifier is pointing to the end of the lookahead match and moving back the number of character of the lookahead minimum boundary. It also seem to explain the effect of my original encounter with the bug. It may as well as why the workaround of testing the length first with another avoids the bug because that consuming qualifier it that case is usually just , though + seem to work too which doesn't quite fit but consuming qualifiers with minimum boundaries of zero or one don't seem to be effected in any case. In all of the above test cased those values were below the minimal threshold of every successful test. However the test cases above has only one lookahead that doesn't backtrack, who know what influence backtracking, additional lookaheads or how addition qualifiers in the consuming parts pattern would be effected. Now while the test values support my theory I can't say for sure that things are happening exactly the way I've laid out. The actual mechanics may be different but whatever is happening under the hood clearly pointers are being corrupted and the regex engine is loosing it's place

    The thing that was so confusing about this was it only kicks in with a qualifier in the consuming portion is encountered but if there was something between the lookahead and the qualified portion it match normally. So this is something it's really hard to make test cases for because you get these ghost value popping up latter on in the test than I expected. Not to mention the pattern itself is correct so even with a tool that will let you step through the matching you don't get this behavior unless that tool is using VBScript's regex engine and I haven't seen such a tool for that engine.

    If you are using lookaheads with JavaScript client-side then you are going to be susceptible to the bug in IE because it will use the Jscript engine. And while you should always validate server-side if VBSCript or Jscript is your server-side language you are still at risk. So a platform like classic ASP which uses both of those languages by default is at risk client and server side, but a platform like PHP while it still suffer the bug client-side for IE, should work correctly server-side which is using a different regex engine. Same goes for non-web clients using the JScript/VBScript DLL.

    The workaround for the strong password type of regex is when using lookaheads to include the upper boundary test(s) before the no upper boundary test then use .* to consume characters. The bound test should keep the pattern from running forever. However depending on the complexity of your criteria this may not always be an option but try it first anyway.

  • Validating Email Revisited

    First off let me say I'm a bit over my head here. Not regex part but host the language of the regex engine.

    Many moons ago I posted a blog article stating why you could not write a regex that validated an e-mail address 100%. Well this is still true, however in that posted I also stated that the pattern was so massive that it wasn't worth using. This is also still true however I was made aware of a flavor-specific syntax that reduces the regex from massive to very large.

    This regex is for the PCRE engine. http://www.myregextester.com/?r=337

    Though from what I've read this will work for PHP too.  Now I don't know Perl or PHP or what minimum version of PCRE supports this syntax. That being the case I also don't how well it performs. I wrote the original version using the .Net syntax and not only was the regex massive, which is one reason I never posted it but the performance was terrible. Given that most people want to use this type regex to validate a data entry field, the pattern was overkill. In fact I recommend that you don't use this, except to learn from. The PCRE version may perform better but I don't have the means or time to test, so use at your own risk. For simple field validation even this is still overkill. For a large text file performance may suffer horribly. Most likely you aren't going to want to use this pattern as it is too large for simple test and performs poorly for large test.

    When I see people asking for Email regex, I point out that perfect validation is not possible. And when I see so-call email validating regex that are only about 50 characters long, it makes me chuckle. This pattern is probably to most compact version of a RFC 2822 address regex you'll find and it is still huge. Ports to other regex engines not supporting the recursive syntax will easily be 4x as large as my .Net version was.

    The above pattern does the RFC Spec up to the address-spec, which pretty much what people are thinking about when they are saying Email address.

    It not to hard to take to it up a few more level in the spec using this syntax

    RFC 2822 mailbox : http://www.myregextester.com/?r=338

    but like I said it likely won't perform well enough to be useful. The two patterns I've linked to I've wrapped in anchors so they are just matching against the whole string. Searching  for a string within a larger body, without anchors will probably degrade performance very fast.  But if any of you PHP or Perl gurus want to stress test this beast, have fun. Maybe it's not as bad as I think it may be.




  • Update to CSS Minification

    This is a C# 2.0 enhancement of a C# port of YUI Compressor's  CSS minification code

    I got a little carried away with ideas for this, they were all regex based which really is what motivated me to work on it. However after I thought I was done I learned not everything worked. It did what I wanted it to do but what I wanted wasn't the correct thing. I really should have just stopped with my original ideas.

    The last idea for my original changes was to take 2 or more individual subset properties and write them in shorthand notation of the main property they were a subset of. Well I got that to working. But upon testing I learned something new about CSS that I didn't know. Basically that what I was doing could alter the behavior of the presentation. Which was disappointing because I put a lot of energy into getting the results I was after.

    So it looked as all of that code was going to go to waste. But there was one scenario that what I was trying to do was alright. So the code wasn't completely wasted. The one scenario was if all the subset properties are declared then combining them is fine. I didn't bother changing the regexes I wrote for this but I cleaned up some of the code. Though it would have worked as is some of the things being checked were now unnecessary.

     

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Globalization;
    using System.Text;
    using System.Text.RegularExpressions;

    namespace CSSMinify
    {
        class CSSMinify
        {
            public static Hashtable shortColorNames = new Hashtable();
            public static Hashtable shortHexColors = new Hashtable();
            public static string Minify(string css)
            {
                return Minify(css, 0);
            }
            public static string Minify(string css, int columnWidth)
            {
                // BSD License http://developer.yahoo.net/yui/license.txt
                // New css tests and regexes by Michael Ash

                createHashTable();
                MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
                MatchEvaluator shortColorNameDelegate = new MatchEvaluator(ShortColorNameMatchHandler);
                MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
                css = RemoveCommentBlocks(css);
                css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
                css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
                /* Remove the spaces before the things that should not have spaces before them.
                   But, be careful not to turn "p :link {...}" into "p:link{...}"
                */
                css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
                css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1");  // Remove the spaces after the things that should not have spaces after them.
                css = Regex.Replace(css, @"([^;}])}", "$1;}");    // Add the semicolon where it's missing.
                css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
                css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
                //New test
                //Font weights
                css = Regex.Replace(css, @"(?<=font-weight:)normal\b", "400");
                css = Regex.Replace(css, @"(?<=font-weight:)bold\b", "700");
                //Thought this was a good idea but properties of a set not defined get element defaults. This is reseting them. css = ShortHandProperty(css);
                css = ShortHandAllProperties(css);
                //css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
                // if all 4 parameters the same unit make 1 parameter
                css = Regex.Replace(css, @"(?<!background-position\s*):\s*(inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
                // if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
                css = Regex.Replace(css, @":\s*((inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
                // if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
                css = Regex.Replace(css, @":\s*((?:(?:inherit|auto|0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(inherit|auto|0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
                //// if has 3 parameters and top unit = bottom unit make 2 parameters
                //css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
                css = Regex.Replace(css, "background-position:0;", "background-position:0 0;");
                css = Regex.Replace(css, @"(:|\s)0+\.(\d+)", "$1.$2");
                //  Outline-styles and Border-sytles parameter reduction
                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);

                //  Outline-color and Border-color parameter reduction
                css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);

                css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);

                // Shorten colors from rgb(51,102,153) to #336699
                // This makes it more likely that it'll get further compressed in the next step.
                css = Regex.Replace(css, @"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
                css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
                // Replace hex color code with named value is shorter
                css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red", RegexOptions.IgnoreCase);
                css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
                css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate, RegexOptions.IgnoreCase);

                // Remove empty rules.
                css = Regex.Replace(css, @"[^}]+{;}", "");
                //Remove semicolon of last property
                css = Regex.Replace(css, ";(})", "$1");
                if (columnWidth > 0)
                {
                    css = BreakLines(css, columnWidth);
                }
                return css;
            }
            private static string RemoveCommentBlocks(string input)
            {
                int startIndex = 0;
                int endIndex = 0;
                bool iemac = false;
                startIndex = input.IndexOf(@"/*", startIndex);
                while (startIndex >= 0)
                {
                    endIndex = input.IndexOf(@"*/", startIndex + 2);
                    if (endIndex >= startIndex + 2)
                    {
                        if (input[endIndex - 1] == '\\')
                        {
                            startIndex = endIndex + 2;
                            iemac = true;
                        }
                        else if (iemac)
                        {
                            startIndex = endIndex + 2;
                            iemac = false;
                        }
                        else
                        {
                            input = input.Remove(startIndex, endIndex + 2 - startIndex);
                        }
                    }
                    startIndex = input.IndexOf(@"/*", startIndex);
                }
                return input;
            }
            private static String RGBMatchHandler(Match m)
            {
                int val = 0;
                StringBuilder hexcolor = new StringBuilder("#");
                for (int index = 1; index <= 3; index += 1)
                {
                    val = Int32.Parse(m.Groups[index].Value);
                    hexcolor.Append(val.ToString("x2"));
                }
                return hexcolor.ToString();
            }
            private static string BreakLines(string css, int columnWidth)
            {
                int i = 0;
                int start = 0;
                StringBuilder sb = new StringBuilder(css);
                while (i < sb.Length)
                {
                    char c = sb[i++];
                    if (c == '}' && i - start > columnWidth)
                    {
                        sb.Insert(i, '\n');
                        start = i;
                    }
                }
                return sb.ToString();

            }
            private static string ReplaceNonEmpty(string inputText, string replacementText)
            {
                if (replacementText.Trim() != string.Empty)
                {
                    inputText = string.Format(" {0}", replacementText);
                }
                return inputText;
            }
            private static string ShortColorNameMatchHandler(Match m)
            {
                // This function replace hex color values named colors if the name is shorter than the hex code
                string returnValue = m.Value;
                if (shortColorNames.ContainsKey(m.Groups["hex"].Value))
                {
                    returnValue = shortColorNames[m.Groups["hex"].Value].ToString();
                }
                return returnValue;
            }
            private static string ShortColorHexMatchHandler(Match m)
            {
                //This function replaces named values with there shorter hex equivalent
                return shortHexColors[m.Value.ToString().ToLower()].ToString();
            }
            private static void createHashTable()
            {
                //Color names shorter than hex notation. Except for red.
                shortColorNames.Add("F0FFFF".ToLower(), "Azure".ToLower());
                shortColorNames.Add("F5F5DC".ToLower(), "Beige".ToLower());
                shortColorNames.Add("FFE4C4".ToLower(), "Bisque".ToLower());
                shortColorNames.Add("A52A2A".ToLower(), "Brown".ToLower());
                shortColorNames.Add("FF7F50".ToLower(), "Coral".ToLower());
                shortColorNames.Add("FFD700".ToLower(), "Gold".ToLower());
                shortColorNames.Add("808080".ToLower(), "Grey".ToLower());
                shortColorNames.Add("008000".ToLower(), "Green".ToLower());
                shortColorNames.Add("4B0082".ToLower(), "Indigo".ToLower());
                shortColorNames.Add("FFFFF0".ToLower(), "Ivory".ToLower());
                shortColorNames.Add("F0E68C".ToLower(), "Khaki".ToLower());
                shortColorNames.Add("FAF0E6".ToLower(), "Linen".ToLower());
                shortColorNames.Add("800000".ToLower(), "Maroon".ToLower());
                shortColorNames.Add("000080".ToLower(), "Navy".ToLower());
                shortColorNames.Add("808000".ToLower(), "Olive".ToLower());
                shortColorNames.Add("FFA500".ToLower(), "Orange".ToLower());
                shortColorNames.Add("DA70D6".ToLower(), "Orchid".ToLower());
                shortColorNames.Add("CD853F".ToLower(), "Peru".ToLower());
                shortColorNames.Add("FFC0CB".ToLower(), "Pink".ToLower());
                shortColorNames.Add("DDA0DD".ToLower(), "Plum".ToLower());
                shortColorNames.Add("800080".ToLower(), "Purple".ToLower());
                shortColorNames.Add("FA8072".ToLower(), "Salmon".ToLower());
                shortColorNames.Add("A0522D".ToLower(), "Sienna".ToLower());
                shortColorNames.Add("C0C0C0".ToLower(), "Silver".ToLower());
                shortColorNames.Add("FFFAFA".ToLower(), "Snow".ToLower());
                shortColorNames.Add("D2B48C".ToLower(), "Tan".ToLower());
                shortColorNames.Add("008080".ToLower(), "Teal".ToLower());
                shortColorNames.Add("FF6347".ToLower(), "Tomato".ToLower());
                shortColorNames.Add("EE82EE".ToLower(), "Violet".ToLower());
                shortColorNames.Add("F5DEB3".ToLower(), "Wheat".ToLower());

                // Hex notation shorter than named value
                shortHexColors.Add("black", "#000");
                shortHexColors.Add("fuchsia", "#f0f");
                shortHexColors.Add("lightSlategray", "#789");
                shortHexColors.Add("lightSlategrey", "#789");
                shortHexColors.Add("magenta", "#f0f");
                shortHexColors.Add("white", "#fff");
                shortHexColors.Add("yellow", "#ff0");
            }
            private static string ShortHandAllProperties(string css)
            {
                /*
                 * This function searchs for properties specifying all the individual properties of a property type
                 * and reduces it to a single property use shorthand notation
                 */
                Regex reCSSBlock = new Regex("{[^{}]*}");
                Regex reTRBL1 = new Regex(@"(?<fullProperty>(?:(?<property>padding)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
                Regex reTRBL2 = new Regex(@"(?<fullProperty>(?:(?<property>margin)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
                Regex reTRBL3 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:color)))\s*:\s*(?<unit>[#\w.]+);?", RegexOptions.IgnoreCase);
                Regex reTRBL4 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:style)))\s*:\s*(?<unit>none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset);?", RegexOptions.IgnoreCase);
                Regex reTRBL5 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:width)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
                Regex reListStyle = new Regex(@'list-style-(?<style>type|image|position)\s*:\s*(?<unit>[^};]+);?', RegexOptions.IgnoreCase);
                Regex reFont = new Regex(@"font-(?:(?:(?<fontProperty>family\b)\s*:\s*(?<fontPropertyValue>(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22)(?:\s*,\s*(?:\b[a-zA-Z]+(-[a-zA-Z]+)?\b|\x22[^\x22]+\x22))*)\b)|
    (?:(?<fontProperty>style\b)\s*:\s*(?<fontPropertyValue>normal|italic|oblique|inherit))|
    (?:(?<fontProperty>variant\b)\s*:\s*(?<fontPropertyValue>normal|small-caps|inherit))|
    (?:(?<fontProperty>weight\b)\s*:\s*(?<fontPropertyValue>normal|bold|(?:bold|light)er|[1-9]00|inherit))|
    (?:(?<fontProperty>size\b)\s*:\s*(?<fontPropertyValue>(?:(?:xx?-)?(?:small|large))|medium|(?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex))\b)|inherit|\b0\b)))\s*;?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace));
                Regex reBackGround = new Regex(@"background-(?:
    (?:(?<property>color)\s*:\s*(?<unit>transparent|inherit|(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)))|
    (?:(?<property>image)\s*:\s*(?<unit>none|inherit|(?:url\s*\([^()]+\))))|
    (?:(?<property>repeat)\s*:\s*(?<unit>no-repeat|inherit|repeat(?:-[xy])))|
    (?:(?<property>attachment)\s*:\s*(?<unit>scroll|inherit|fixed))|
    (?:(?<property>position)\s*:\s*(?<unit>((?<horizontal>left | center | right|(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+(?<vertical>top | center | bottom |(?:0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))))|
        ((?<vertical>top | center | bottom )\s+(?<horizontal>left | center | right ))|
        ((?<horizontal>left | center | right )|(?<vertical>top | center | bottom ))))
    );?", (RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture));
                MatchCollection mcBlocks = reCSSBlock.Matches(css);
                foreach (Match mBlock in mcBlocks)
                {
                    string strBlock = mBlock.Value;
                    HasAllPositions(reTRBL1, ref strBlock);
                    HasAllPositions(reTRBL2, ref strBlock);
                    HasAllPositions(reTRBL3, ref strBlock);
                    HasAllPositions(reTRBL4, ref strBlock);
                    HasAllPositions(reTRBL5, ref strBlock);
                    HasAllListStyle(reListStyle, ref strBlock);
                    HasAllFontProperties(reFont, ref strBlock);
                    HasAllBackGroundProperties(reBackGround, ref strBlock);
                    css = css.Replace(mBlock.Value, strBlock);
                }
                return css;
            }
            private static void HasAllBackGroundProperties(Regex re, ref string CSSText)
            {
                {
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    int z = 5;
                    if (mcProperySet.Count == z)
                    {

                        int y = 0;
                        for (int x = 0; x < z; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["property"].Value)
                            {
                                case "color":
                                    y = y + 1;
                                    break;
                                case "image":
                                    y = y + 2;
                                    break;
                                case "repeat":
                                    y = y + 4;
                                    break;
                                case "attachment":
                                    y = y + 8;
                                    break;
                                case "position":
                                    y = y + 16;
                                    break;
                            }
                        }
                        if (y == 31)
                        {
                            CSSText = ShortHandBackGroundReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static void HasAllFontProperties(Regex re, ref string CSSText)
            {
                {
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    int z = 5;
                    if (mcProperySet.Count == z)
                    {

                        int y = 0;
                        for (int x = 0; x < z; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["fontProperty"].Value)
                            {
                                case "style":
                                    y = y + 1;
                                    break;
                                case "variant":
                                    y = y + 2;
                                    break;
                                case "weight":
                                    y = y + 4;
                                    break;
                                case "size":
                                    y = y + 8;
                                    break;
                                case "family":
                                    y = y + 16;
                                    break;
                            }
                        }
                        if (y == 31)
                        {
                            CSSText = ShortHandFontReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static void HasAllListStyle(Regex re, ref string CSSText)
            {
                {
                    int z = 3;
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    if (mcProperySet.Count == z)
                    {

                        int y = 0;
                        for (int x = 0; x < z; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["style"].Value)
                            {
                                case "type":
                                    y = y + 1;
                                    break;
                                case "image":
                                    y = y + 2;
                                    break;
                                case "position":
                                    y = y + 4;
                                    break;

                            }
                        }
                        if (y == 7)
                        {
                            CSSText = ShortHandListReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static void HasAllPositions(Regex re, ref string CSSText)
            {
                {
                    MatchCollection mcProperySet = re.Matches(CSSText);
                    if (mcProperySet.Count == 4)
                    {

                        int y = 0;
                        for (int x = 0; x < 4; x = x + 1)
                        {
                            switch (mcProperySet[x].Groups["position"].Value)
                            {
                                case "top":
                                    y = y + 1;
                                    break;
                                case "right":
                                    y = y + 2;
                                    break;
                                case "bottom":
                                    y = y + 4;
                                    break;
                                case "left":
                                    y = y + 8;
                                    break;
                            }
                        }
                        if (y == 15)
                        {
                            CSSText = ShortHandReplaceV2(mcProperySet, re, CSSText);
                        }
                    }
                }
            }
            private static string ShortHandFontReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
            {
                /*
                 * This Function replaces the individual font properties with a single entry
                 * */
                string strFamily, strStyle, strVariant, strWeight, strSize;
                Regex reLineHeight = new Regex(@"line-height\s*:\s*((?:\d*\.?\d+(?:%|(p(?:[xct])|(?:[cem])m|in|ex)\b)?)|normal|inherit);?", RegexOptions.IgnoreCase);
                strFamily = string.Empty;
                strStyle = string.Empty;
                strVariant = string.Empty;
                strWeight = string.Empty;
                strSize = string.Empty;
                string strStyle_Variant_Weight = string.Empty;
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups[""].Value)
                    {
                        case "family":
                            strFamily = string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
                            break;
                        case "size":
                            if (reLineHeight.IsMatch(InputText))
                            {
                                Match m = reLineHeight.Match(InputText);
                                if (m.Groups[1].Value != "normal")
                                {
                                    strSize = String.Format("/{0}", m.Groups[1].Value);
                                }
                                InputText = reLineHeight.Replace(InputText, string.Empty);
                            }
                            strSize = string.Format(" {0}{1}", mProperty.Groups["fontPropertyValue"].Value, strSize);
                            if (strSize == "medium")
                            {
                                strSize = string.Empty;
                            }
                            break;
                        case "style":
                        case "variant":
                        case "weight":
                            if (mProperty.Groups["fontPropertyValue"].Value != "normal")
                            {
                                strStyle_Variant_Weight += string.Format(" {0}", mProperty.Groups["fontPropertyValue"].Value);
                            } break;

                    }
                }

                string strShortcut;
                string strProperties = string.Format("{0}{1}{2};", strStyle_Variant_Weight, strVariant, strWeight, strSize, strFamily);
                strShortcut = string.Format("font:{0}", strProperties.Trim());
                string strNewBlock = re.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
            private static string ShortHandBackGroundReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
            {
                /*
                 * This Function replaces the individual background properties with a single entry
                 * */
                string strColor, strImage, strRepeat, strAttachment, strPosition;
                strColor = string.Empty;
                strImage = string.Empty;
                strRepeat = string.Empty;
                strAttachment = string.Empty;
                strPosition = string.Empty;
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups["property"].Value)
                    {
                        case "color":
                            if (mProperty.Groups["unit"].Value != "transparent")
                            {
                                strColor = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "image":
                            if (mProperty.Groups["unit"].Value != "none")
                            {
                                strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "repeat":
                            if (mProperty.Groups["unit"].Value != "repeat")
                            {
                                strRepeat = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            } break;
                        case "attachment":
                            if (mProperty.Groups["unit"].Value != "scroll")
                            {
                                strAttachment = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "position":
                            if (mProperty.Groups["unit"].Value != "0% 0%")
                            {
                                strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                    }
                }

                string strShortcut;
                string strProperties = string.Format("{0}{1}{2}{3}{4};", strColor, strImage, strRepeat, strAttachment, strPosition);
                strShortcut = string.Format("background:{0}", strProperties.Trim());
                string strNewBlock = re.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
            private static string ShortHandReplaceV2(MatchCollection mcProperySet, Regex reTRBL1, string InputText)
            {
                // Replace method for regexes used in ShortHand property method for properties with top, right, bottom and left sub properties.
                string strTop, strRight, strBottom, strLeft;
                strTop = string.Empty;
                strRight = string.Empty;
                strBottom = string.Empty;
                strLeft = string.Empty;
                string strProperty;
                strProperty = string.Format("{0}{1}", mcProperySet[0].Groups["property"].Value, mcProperySet[0].Groups["property2"].Value);
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups["position"].Value)
                    {
                        case "top":
                            strTop = mProperty.Groups["unit"].Value;
                            break;
                        case "right":
                            strRight = mProperty.Groups["unit"].Value;
                            break;
                        case "bottom":
                            strBottom = mProperty.Groups["unit"].Value;
                            break;
                        case "left":
                            strLeft = mProperty.Groups["unit"].Value;
                            break;
                    }

                }

                string strShortcut = string.Format("{0}:{1} {2} {3} {4};", strProperty, strTop, strRight, strBottom, strLeft);
                string strNewBlock = reTRBL1.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
            private static string ShortHandListReplaceV2(MatchCollection mcProperySet, Regex re, string InputText)
            {
                /*
                 * This Function replaces the individual list properties with a single entry
                 * */
                string strType, strPosition, strImage;
                strType = string.Empty;
                strPosition = string.Empty;
                strImage = string.Empty;
                foreach (Match mProperty in mcProperySet)
                {
                    switch (mProperty.Groups["style"].Value)
                    {
                        case "type":
                            if (mProperty.Groups["unit"].Value != "disc")
                            {
                                strType = mProperty.Groups["unit"].Value;
                            }
                            break;
                        case "position":
                            if (mProperty.Groups["unit"].Value != "outside")
                            {
                                strPosition = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                        case "style":
                            if (mProperty.Groups["unit"].Value != "none")
                            {
                                strImage = string.Format(" {0}", mProperty.Groups["unit"].Value);
                            }
                            break;
                    }

                }

                string strShortcut = string.Format("list-style:{0}{1}{2};", strType, strPosition, strImage);
                string strNewBlock = re.Replace(InputText, "");
                strNewBlock = strNewBlock.Insert(1, strShortcut);
                return strNewBlock;
            }
        }
    }
     

  • Follow up to Additional CSS minifying regex patterns

    OK, there regexes were discussed in the previous post this is mostly just their application.

    This is a C# 2.0 enhancement of a C# port of YUI Compressor's  CSS minification code

     Since I was doing this is C# I took full advantage of it's regex engine, namely using lookbehinds and delegates for some replaces.

    Almost all the regexes after the "New Test" comment are the new or modified regexes from the ported version. There is also one new and two modified expressions before that comment. One of those modification is just a change in writing style, the other modifications are replacing some code but (hopefully) not functionality with a regex replace. The new regex replacements of course are the new compression enhancements.

    There are also a couple of new regexes not mentioned in the previous post that match and replace some of the color values with an equivalent but a more concisely written value. The replace the color "red" is a straight replace but the other colors require some code evaluation and are using delegates.

    I've done some very limited testing but as I mentioned in the previous post most of the CSS I've written doesn't have some of the new things I was searching for. I could add them for a test (which I did) but that won't catch any problems they my cause to the actual CSS application since I wasn't really using the test values. So the source code is now available for beta testing.  Test early and often before committing to use it.  I'm willing to fix any minor bugs for things I may have overlook but if a particular replace is problematic it's easy enough to comment out the offender and use the rest.

    And as was mentioned in the comments of the previous post any generated content that looks like CSS may get stepped on so be aware of that.

    And also that all licenses for previous versions still apply.

    UPDATE 2008-04-27

    After a little more testing I discovered one of the replaces I was doing can alter how the CSS is processed.  So I have just crossed out the  functions and function call
    I've come up with a safer, though less likely to occur replacement.

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.Globalization;
    using System.Text;
    using System.Text.RegularExpressions;
     namespace CSSMinify
    {
     class CSSMinify
     {
       public static Hashtable shortColorNames = new Hashtable();
       public static Hashtable shortHexColors = new Hashtable();
       public static string Minify(string css)
       {
         return Minify(css, 0);
       }
       public static string Minify(string css, int columnWidth)
       {
       // BSD License http://developer.yahoo.net/yui/license.txt
       // New css tests and regexes by Michael Ash
         createHashTable();
         MatchEvaluator rgbDelegate = new MatchEvaluator(RGBMatchHandler);
         MatchEvaluator shortColorNameDelegate = new     MatchEvaluator(ShortColorNameMatchHandler);
         MatchEvaluator shortColorHexDelegate = new MatchEvaluator(ShortColorHexMatchHandler);
         css = RemoveCommentBlocks(css);
         css = Regex.Replace(css, @"\s+", " "); //Normalize whitespace
         css = Regex.Replace(css, @"\x22\x5C\x22}\x5C\x22\x22", "___PSEUDOCLASSBMH___"); //hide Box model hack
         /* Remove the spaces before the things that should not have spaces before           them.
             But, be careful not to turn "p :link {...}" into "p:link{...}"
         */
         css = Regex.Replace(css, @"(?#no preceding space needed)\s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))", "$1");
         css = Regex.Replace(css, @"([!{}:;>+([,])\s+", "$1"); // Remove the spaces after the things that should not have spaces after them.
         css = Regex.Replace(css, @"([^;}])}", "$1;}"); // Add the semicolon where it's missing.
         css = Regex.Replace(css, @"(\d+)\.0+(p(?:[xct])|(?:[cem])m|%|in|ex)\b", "$1$2"); // Remove .0 from size units x.0em becomes xem
         css = Regex.Replace(css, @"([\s:])(0)(px|em|%|in|cm|mm|pc|pt|ex)\b", "$1$2"); // Remove unit from zero
         //New test
         css = ShortHandProperty(css);
         //css = Regex.Replace(css, @":(\s*0){2,4}\s*;", ":0;"); // if all parameters zero just use 1 parameter
         // if all 4 parameters the same unit make 1 parameter
         css = Regex.Replace(css, @":\s*(0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};", ":$1;", RegexOptions.IgnoreCase);
         // if has 4 parameters and top unit = bottom unit and right unit = left unit make 2 parameters
         css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;", ":$1;", RegexOptions.IgnoreCase);
         // if has 4 parameters and top unit != bottom unit and right unit = left unit make 3 parameters
         css = Regex.Replace(css, @":\s*((?:(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
         //// if has 3 parameters and top unit = bottom unit make 2 parameters
         //css = Regex.Replace(css, @":\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|    (?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;", ":$1;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css,"background-position:0;", "background-position:0 0;");
         css = Regex.Replace(css,@"(:|\s)0+\.(\d+)", "$1.$2");
       // Outline-styles and Border-sytles parameter reduction
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)(?:\s+\2){1,3};", "$1-style:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);", "$1-style:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset)\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);", "$1-style:$2;", RegexOptions.IgnoreCase);
         // Outline-color and Border-color parameter reduction
         css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};", "$1-color:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);", "$1-color:$2;", RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);", "$1-color:$2;", RegexOptions.IgnoreCase);
     // Shorten colors from rgb(51,102,153) to #336699
     // This makes it more likely that it'll get further compressed in the next step.
         css = Regex.Replace(css,@"rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29", rgbDelegate);
         css = Regex.Replace(css, @"(?<![\x22\x27=]\s*)\#(?:([0-9A-F])\1)(?:([0-9A-F])\2)(?:([0-9A-F])\3)", "#$1$2$3", RegexOptions.IgnoreCase);
     // Replace hex color code with named value is shorter
         css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>f00)\b", "red",RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(?<=color\s*:\s*.*)\#(?<hex>[0-9a-f]{6})", shortColorNameDelegate, RegexOptions.IgnoreCase);
         css = Regex.Replace(css, @"(?<=color\s*:\s*)\b(Black|Fuchsia|LightSlateGr[ae]y|Magenta|White|Yellow)\b", shortColorHexDelegate,RegexOptions.IgnoreCase);
         // Remove empty rules.
         css = Regex.Replace(css,@"[^}]+{;}", "");
         //Remove semicolon of last property
         css = Regex.Replace(css, ";(})", "$1");
         if (columnWidth > 0)
         {
           css = BreakLines(css, columnWidth);
         }
         return css;
     }
     private static string RemoveCommentBlocks(string input)
     {
       int startIndex = 0;
       int endIndex = 0;
       bool iemac = false;
       startIndex = input.IndexOf(@"/*", startIndex);
       while (startIndex >= 0)
       {
         endIndex = input.IndexOf(@"*/", startIndex + 2);
         if (endIndex >= startIndex + 2)
         {
           if (input[endIndex - 1] == '\\')
           {
             startIndex = endIndex + 2;
             iemac = true;
           }
           else if (iemac)
           {
             startIndex = endIndex + 2;
             iemac = false;
            }
           else
           {
            input = input.Remove(startIndex, endIndex + 2 - startIndex);
            }
         }
         startIndex = input.IndexOf(@"/*", startIndex);
       }
     return input;
    }
     private static String RGBMatchHandler(Match m)
     {
       int val = 0;
       StringBuilder hexcolor = new StringBuilder("#");
       for(int index=1; index <= 3; index += 1)
       {
         val = Int32.Parse(m.Groups[index].Value);
         hexcolor.Append(val.ToString("x2"));
       }
       return hexcolor.ToString();
     }
     private static string BreakLines(string css, int columnWidth)
    {
       int i = 0;
       int start = 0;
       StringBuilder sb = new StringBuilder(css);
       while (i < sb.Length)
       {
         char c = sb[i++];
         if (c == '}' && i - start > columnWidth)
         {
           sb.Insert(i, '\n');
           start = i;
         }
       }
     return sb.ToString();
     }
     private static string ShortHandProperty(string css)
     {
     /*
      * This function searchs for properties specifying at least 2 of the top, right, bottom or left box model
      * positions and reduces it to a single property use shorthand notation
      */
       Regex reCSSBlock = new Regex("{[^{}]*}");
       Regex reTRBL1 = new Regex(@"(?<fullProperty>(?:(?<property>padding)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
       Regex reTRBL2 = new Regex(@"(?<fullProperty>(?:(?<property>margin)-(?<position>top|right|bottom|left)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
       Regex reTRBL3 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:color)))\s*:\s*(?<unit>[#\w.]+);?", RegexOptions.IgnoreCase);
       Regex reTRBL4 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:style)))\s*:\s*(?<unit>none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset);?",  RegexOptions.IgnoreCase);
       Regex reTRBL5 = new Regex(@"(?<fullProperty>(?<property>border)-(?<position>top|right|bottom|left)(?<property2>-(?:width)))\s*:\s*(?<unit>[\w.]+);?", RegexOptions.IgnoreCase);
       MatchCollection mcBlocks = reCSSBlock.Matches(css);
       foreach (Match mBlock in mcBlocks)
       {
         string strBlock= mBlock.Value;
         MatchCollection mcProperySet = reTRBL1.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL1, strBlock);
     
         }
         mcProperySet = reTRBL2.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL2, strBlock);
         }
         mcProperySet = reTRBL3.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL3, strBlock);
         }
         mcProperySet = reTRBL4.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL4, strBlock);
         }
         mcProperySet = reTRBL5.Matches(strBlock);
         if (mcProperySet.Count > 1)
         {
           strBlock = ShortHandReplace(mcProperySet, reTRBL5, strBlock);
         }
         css = css.Replace(mBlock.Value, strBlock);
       }
       return css;
     }
     private static string ShortHandReplace(MatchCollection mcProperySet, Regex reTRBL1, string InputText)
     {
     // Replace method for regexes used in ShortHand property method.
       string strTop, strRight, strBottom, strLeft;
       strTop = string.Empty;
       strRight = string.Empty;
       strBottom = string.Empty;
       strLeft = string.Empty;
       string strProperty;
       string strDefaultValue;
     
       strProperty = string.Format("{0}{1}", mcProperySet[0].Groups["property"].Value, mcProperySet[0].Groups["property2"].Value);
       switch (strProperty){
         case "border-color":
           strDefaultValue = "inherit";
           break;
         case "border-style":
           strDefaultValue = "none";
           break;
         default:
           strDefaultValue = "0";
           break;
       }
       foreach (Match mProperty in mcProperySet)
       {
         if (mProperty.Groups["position"].Value == "top")
         {
           if (strTop == string.Empty)
           {
             strTop = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
         }
         if (mProperty.Groups["position"].Value == "right")
         {
           if (strRight == string.Empty)
           {
             strRight = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
          }
         if (mProperty.Groups["position"].Value == "bottom")
         {
           if (strBottom == string.Empty)
           {
             strBottom = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
         }
         if (mProperty.Groups["position"].Value == "left")
         {
           if (strLeft == string.Empty)
           {
             strLeft = mProperty.Groups["unit"].Value;
           }
           else
           {
             break;
           }
         }
       }
       if (strTop == string.Empty)
       {
         strTop = strDefaultValue;
       }
       if (strRight == string.Empty)
       {
         strRight = strDefaultValue;
       }
       if (strBottom == string.Empty)
       {
         strBottom = strDefaultValue;
       }
       if (strLeft == string.Empty)
       {
         strLeft = strDefaultValue;
       }
       string strShortcut = string.Format("{0}:{1} {2} {3} {4};", strProperty, strTop, strRight, strBottom, strLeft);
       string strNewBlock = reTRBL1.Replace(InputText, "");
       strNewBlock = strNewBlock.Insert(1, strShortcut);
       return strNewBlock;
     }

     private static string ShortColorNameMatchHandler(Match m)
     {
     // This function replace hex color values named colors if the name is shorter than the hex code
       string returnValue = m.Value;
       if (shortColorNames.ContainsKey(m.Groups["hex"].Value))
       {
         returnValue = shortColorNames[m.Groups["hex"].Value].ToString();
       }
       return returnValue;
     }
     private static string ShortColorHexMatchHandler(Match m)
     {
       return shortHexColors[m.Value.ToString().ToLower()].ToString();
     }
     private static void createHashTable()
     {
     //Color names shorter than hex notation. Except for red.
       shortColorNames.Add("F0FFFF".ToLower(), "Azure".ToLower());
       shortColorNames.Add("F5F5DC".ToLower(), "Beige".ToLower());
       shortColorNames.Add("FFE4C4".ToLower(), "Bisque".ToLower());
       shortColorNames.Add("A52A2A".ToLower(), "Brown".ToLower());
       shortColorNames.Add("FF7F50".ToLower(), "Coral".ToLower());
       shortColorNames.Add("FFD700".ToLower(), "Gold".ToLower());
       shortColorNames.Add("808080".ToLower(), "Grey".ToLower());
       shortColorNames.Add("008000".ToLower(), "Green".ToLower());
       shortColorNames.Add("4B0082".ToLower(), "Indigo".ToLower());
       shortColorNames.Add("FFFFF0".ToLower(), "Ivory".ToLower());
       shortColorNames.Add("F0E68C".ToLower(), "Khaki".ToLower());
       shortColorNames.Add("FAF0E6".ToLower(), "Linen".ToLower());
       shortColorNames.Add("800000".ToLower(), "Maroon".ToLower());
       shortColorNames.Add("000080".ToLower(), "Navy".ToLower());
       shortColorNames.Add("808000".ToLower(), "Olive".ToLower());
       shortColorNames.Add("FFA500".ToLower(), "Orange".ToLower());
       shortColorNames.Add("DA70D6".ToLower(), "Orchid".ToLower());
       shortColorNames.Add("CD853F".ToLower(), "Peru".ToLower());
       shortColorNames.Add("FFC0CB".ToLower(), "Pink".ToLower());
       shortColorNames.Add("DDA0DD".ToLower(), "Plum".ToLower());
       shortColorNames.Add("800080".ToLower(), "Purple".ToLower());
       shortColorNames.Add("FA8072".ToLower(), "Salmon".ToLower());
       shortColorNames.Add("A0522D".ToLower(), "Sienna".ToLower());
       shortColorNames.Add("C0C0C0".ToLower(), "Silver".ToLower());
       shortColorNames.Add("FFFAFA".ToLower(), "Snow".ToLower());
       shortColorNames.Add("D2B48C".ToLower(), "Tan".ToLower());
       shortColorNames.Add("008080".ToLower(), "Teal".ToLower());
       shortColorNames.Add("FF6347".ToLower(), "Tomato".ToLower());
       shortColorNames.Add("EE82EE".ToLower(), "Violet".ToLower());
       shortColorNames.Add("F5DEB3".ToLower(), "Wheat".ToLower());
       // Hex notation shorter than named value
       shortHexColors.Add("black", "#000");
       shortHexColors.Add("fuchsia", "#f0f");
       shortHexColors.Add("lightSlategray", "#789");
       shortHexColors.Add("lightSlategrey", "#789");
       shortHexColors.Add("magenta", "#f0f");
       shortHexColors.Add("white", "#fff");
       shortHexColors.Add("yellow", "#ff0");
       }
     }
    }
     

  • Additional CSS minifying regex patterns

    NOTE: All the regex referenced on this page written by me are using IgnoreCase = true

    I was looking at the regexes used in the YUI Compressor to minify CSS and came up with a couple of more that I think could help the process. The code and port I was looking at was already trimming unneeded zeros used for the top, right, bottom, left values with a simple string replace. But there were three separate replaces being done. It was pretty simple to come up with a regex to handle all the cases

    (Pseudo code)


    string.Replace(":0 0 0 0;","0;")
    string.Replace(":0 0 0;","0;")
    string.Replace(":0 0;","0;")

    becomes

    Regex.Replace(input,":(\s*0)(\s+0){0,3}\s*;",":0;")

    Pretty simple but of course I thought why stop there so I came up with a regex for all numbers

    :\s*(0|(?:(?:\d*\.?\d+(?:p(?:[xct])|(?:[cem])m|%|in|ex))))(\s+\1){1,3};

    The replacement string is simply ":$1;"

    Once this was done next on the list was to handle cases when all the numbers are not the all the same.


    For those who don't know CSS this part of the syntax basically says of the 4 possible values

    1) if only one value is specified the other 3 are implied to the same value (X = X X X X)
    2) if two values are specified the 1st implies the third the second implies the 4th (X Y = X Y X Y)
    3) if three values are specified the second implies the 4th (X Y Z = X Y Z Y)
     
    Of course minifying you want to use the shorter syntax so the following regexes convert the longer to the shorter.
    The replacement string for all is the same as above "$1;"


    # 4 parameters to 2 x y x y to x y

    :\s*((0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2\s+\3;


    # 4 to 3 (x y z y to x y z) or 3 to 2 (x y x to x y)

    :\s*((?:(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+)?(0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex))))\s+(?:0|(?:(?:\d?\.?\d(?:p(?:[xct])|(?:[cem])m|%|in|ex)))))\s+\2;

    Though the make look unwieldy the longer ones just repeat one of the sub-pattern. The only real difference is what is and isn't captured.

    Along those same lines I came up with similar patterns for border-style, outline-style, border-color and outline-color

    border-style/outline-style

    The replacement string for all is “$1-style:$2;”


    (outline|border)-style\s*:\s*(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )(?:\s+\2){1,3};


    (outline|border)-style\s*:\s*((none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3)(?:\s+\4);


    (outline|border)-style\s*:\s*((?:(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+)?(none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset )\s+(?:none|hidden|d(?:otted|ashed|ouble)|solid|groove|ridge|inset|outset ))(?:\s+\3);


    border-color/outline-color

    The replacement string for all is “$1-color:$2;”


    (outline|border)-color\s*:\s*((?:\#(?:[0-9A-F]{3}){1,2})|\S+)(?:\s+\2){1,3};


    (outline|border)-color\s*:\s*(((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+((?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3)(?:\s+\4);


    (outline|border)-color\s*:\s*((?:(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+)?((?:\#(?:[0-9A-F]{3}){1,2})|\S+)\s+(?:(?:\#(?:[0-9A-F]{3}){1,2})|\S+))(?:\s+\3);


    I also came up with a couple of more regexes to replace some code, but as couple of these use look-behinds they are not as portable.

    These work in .Net which supports variable length lookbehinds

    This pattern

    \s+((?:[!{};>+()\],])|(?<={[^{}]*):(?=[^}]*}))

    Is used to find character preceeded by an unneed whitespace taking care not to find colon's that are used as psuedo-selectors or psuedo-classes. Replacement string would be $1

    This pattern

    (?<![\x22\x27=]\s*)\#([0-9A-F])\1([0-9A-F])\2([0-9A-F])\3

    is used for reducing hexadecimal values from AABBCC to ABC, the replacement string would be $1$2$3.

    The patten I created was to find RGB values of th format rgb(x, y, z) where x, y and z are the integers in the range of 0-255.

    rgb\s*\x28((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*,\s*((?:25[0-5])|(?:2[0-4]\d)|(?:[01]?\d?\d))\s*\x29

    The pattern I used was just stricter than the existing one and put each value as a group so I could just work with them with further processing. The match itself is passed to a matchevaluator to do the decimal to hex conversion.

    I did a little alpha testing but most of the CSS I've written doesn't have the stuff I'm testing for. So beta testing is definitely in order. Also I don't really know CSS hacks so I don't know if any of this will have adverse effect on hacks. Other than the top/left/bottom/right parameters I mostly trying to duplicate the existing effects. I'm going to pass this info on the YUI Compressor maintainer to see if they can make use of any of this.

  • A touch of Character Class

    The square brackets character class is one of the more misunderstood of the basic regex features. This feature is supported in virtually all regex implementations. In fact off the top of my head I don't know an implementation that doesn't support it. Maybe it's not well documented in most tutorials or maybe the samples are not clear enough or maybe users are just skimming over the details for this one but I see this feature misused quite often.

    This type of character class simply matches one and only one of the characters between the square brackets. That's pretty much it. Now there are a few caveats to what can go between the brackets but the output is the same. One, and only, character will be returned if there is a match.

    Now a common error rookie regex users tend make by trying to write patterns inside the character class and make it match a particular sequence of characters. Trying to either have it match a particular fixed string of characters or another more generic regex pattern inside the character class. One of the obvious tells of this approach is when the same character appears in the character class multiple times, something like [RADAR] or [HELLO]. Listen up newbies, you can't write any sort of patterns within a character class so don't waste your time trying. If you have a fixed string like “ABC” don't try using a character class [ABC] to match that string as a a whole. As I stated in the previous paragraph this features is for matching a character (singular) from a group of characters, not an ordering of characters (plural) so what you don't get a match of “ABC” what you get is 3 matches since each character in the string is one of the 3 characters. There is nothing you can do inside of the brackets to make it return “ABC” as single match. You can't group these characters together since you can't write any patterns within the character class. All the characters in the class are tested against just one position in your input. Now you can add a qualifier, like + or * after the character class to make “ABC” one match but remember what the character class matches, one character. The qualifier allows the the character class to be applied multiple times, the whole class. None of the possible choices are eliminated in following application of the class. Every application of [ABC] test for 1of 3 possible characters, [ABC]+ test for 1of 3 possible one or more times. So while it will match “ABC” it doesn't only match “ABC” which brings me to my second topic, Order.

    Along with the incorrect belief that you can include a pattern within a character class is that the characters within the class has to be in the same order as the characters being tested against. This isn't true. Everything I said about the regex pattern [ABC] is true for patterns [CBA] or [CAB] or [BCA]. They all match exactly the same thing. With a few exceptions, couple of that have to do with ranges one which I've discussed before, the order of the characters doesn't matter.  If you are using a range the value on the left of the hyphen has to be a lower ASCII value (or code point for Unicode) than the value on the right. Of the previous equivalent character class patterns the only way to write an equivalent range is [A-C]. [C-A] won't work and in some implementations will throw a compile error.

    Now with these first couple of errors it maybe because new users think that alternation and character classes are interchangeable. They are not. Some tutorials may present that as the case especially if the examples for each are too simple. Any character class pattern can be written using alternation, however the reverse isn't always true. Alternation can include single characters or literal or general regex patterns. The character class pattern [ABC] can be written using alternation A|B|C. The alternation pattern of single characters C|D|E can be written [CDE] or[DCE] but the alternation pattern of multiple character patterns (ABC)|(DEF)|(GHI) can not be written in an equivalent pattern using just a character class. Though you could write a pattern that would include matches of the same 3 values you couldn't limit it to matching just those 3 values. Even negated character class can be written using alternation but that would generally lead to enormous patterns making it impractical to write pattern in that fashion. Those trying to write patterns inside of a character class needed to focus their attention to alternation and/or one or more grouping constructs.

    The final error I see a lot is one that gets regex rookies and vets alike, regex metacharacters. With rookie this mostly fall within the trying to use a pattern within the character class by trying to use the qualifiers. Another is using the dot character which in this situation matches only itself. Within a character class in most implementations the only metacharacter that still retains it's special regex meaning is \ , which means it can be used as it normally is. The shorthand character class notations will work the same, except for \b which alters it's meaning, which make since if you think about it. You can't have any patterns within a character class and \b in it's use outside of a character class doesn't match a character and would only be used as part of a pattern.

    However one metacharacter ^ alters it's behavior but only if it is the first character inside the brackets. This is the other position dependent case I was referring to earlier. All the other regular metacharacters that have a special meaning outside a character class lose there special meaning with one and are interpreted as literals, so the is no need to escape these characters though doing so isn't wrong just unnecessary. I talked about the hyphen, which gains a feature, few years ago.

    Finally one last thing I see rookie regex users doing isn't really an error but just make it look like they aren't sure of when the use a character class and that's when they have a positive character class with only one character in it like [k]. There is no reason to do this. Just type the character. If you were negating a single character then that makes sense but otherwise it's just extra typing.


    The character class is great when you are trying to match (or negate) a single character at one position in your input and you have lots of possible characters to choose from, but anything else is outside of its domain.






  • Remember where you come from

    One thing I've noticed among rookie regex users is that they focus way too much on what they are trying to match and not on where they are trying to match it from. There is a tenancy to drastically under estimate the importance of their source data, which they are trying to apply their regex against.

    I see this all the time on forums where posters asking for help almost never post any sample data, and most of the few that do post a sample they made up on the spot. Over on they RegexAdvice construction forum we've added guidelines for posting a question. The whole point of the guidelines are to expedite getting questions answered. One of the things we ask is that posters from a sample of the actual data they are trying to match. We explicitly ask they don't make up a sample. Unfortunately most people don't follow this request, along with others, in the guidelines, which generally result in those questions taking longer to get answered. On this particular point not supplying any sample is worst offense because anyone trying to help you can't see what you are working with and can only give an untested guess solution It like calling a mechanic on the phone and saying “My car isn't running right” leaving your number and waiting for him to call back to tell you how to fix it. He can call back and tell you to change your oil, but without more detail he can't know that will fix your problem. However making up a sample doesn't help much either and is almost just as bad. I know what people think they are being helpful and not boring people with the bloody details by providing a simple example to work with but generally it's not helpful since it doesn't represent their real problem. In regex construction details matter. Often the made up sample is too simple. So what they end of getting is a solution that works great for their sample but doesn't handle their real problem well, if at all. So now you are back where you started. This is like calling your mechanic on the phone and saying “My car making a funny noise.” While it's less vague than “isn't running right” it's still vague. You may know exactly what kind of noise, and how loud it is and where it coming from but that sentence doesn't communicate any of that. Made up samples are about the same when asking for regex help.

    The majority of regular expression problems are content sensitive. Meaning the approach of how to match the target string may depend on where it is the the source string. Now sometime the target string is unique from the surrounding text that the context seems irrelevant. Say you are trying to find a string of digits somewhere in your source data. And there just happens to be only one occurrence of digits are in your source. Well it that case a generic regex that find digits will be good enough.


    Source : “abc345def” or “a125bcd”


    Regex: \d+


    Now I said the content seems irrelevant in this case but it is as relevant as ever. The context is digits only occur once in the string.


    But relevance becomes clearer if you have a source file where there are two or more occurrences of text that mimic your target data but only one is actually the data you want. For example let say you want to find area codes of a phone number. Well if you agree that an area code is a 3 digit number the regex \d{3} might be all need. It depends. On what you ask? Come on you know, say it with me....”Your Source data!”


    If your source data is like the case I mentioned first, where the only digits in the files are area codes or if there are other digit in the file that are not area codes those other numbers are less than three consecutive digits then the simple regex should work


    Sample


    1) AL 205, 256, 334

    2) AK 907

    3) AR 479

    4) etc.


    But if your source is a bunch phone numbers


    1) Mom (555) 123-4567

    2) Sis 1-555-234-4567

    3) Friend 5553456789


    Well now you have a problem. While you general task is still the same, you have to approach it differently because of the source. You can't simply look for 3 consecutive digits since that would return 3 matches for each line. But if we know how the data is structured we can focus on how the data we want sticks out from the rest. In the above sample text area code meets on of three conditions


    1. 3 cons digits inside of parenthesis

    2. first group of 3 consecutive digits in a string of 1-xxx-xxx-xxxx format

    3. first 3 consecutive digits of a ten consecutive digit string


    Now you can still match what you need in one regex but how you choose to approach it changes. Some regex authors will use look-around to examine the surrounding text, others will match the surrounding and use groups to pull out the desired data. Either way you have to expand your field of vision beyond what you are trying to match to address the problem.

    But even in this simple example, understand how knowing what the source data look like is still important. Imagine someone asked you for help and you are trying to solve this problem without seeing the actual source above. If someone post (or ask) the question “I need a regex to match a US area code”

    With No Sample: Any helpful regex author has to guess how your data is formatted. Some may assume it's a string of “(xxx) yyy-yyyy”, others may assume it's “xxx-yyy-yyyy”, some may even think it “State: xxx, xxx, xxx” and all might not it even consider “1-xxx-yyy-yyyy” or “xxxyyyyyyy”.

     

    With Made up sample: “I have a string (555) 555-5555 and I need a regex to match a US area code”

    Now regex author may assume all your phone numbers are in that nice format and that there are no variations. You'll most likely get a regex that match that format perfectly, but fail the other formats you didn't think were important enough to mention.


    With Real Data: Regex authors don't have to guess they can look at your data analyze it themselves and even test edge cases and address issues you may have overlooked.


    I hope this simple example shows how the same problem can have different approaches all because of the source. Where data in the wild is going to be much more complex than this which will only magnify the issue.


    Now sometimes you can't provide real data because either its too much or private. With large chunk of data simply clip a portion containing what you are wanting to match. With private data making up sample is almost excusable but don't make up simple ones. Garble the real data but preserve the format, change names to literary or TV characters, change addresses to the address of some public office (City hall, post office) or landmark which this is essentially a made up sample but if you substitute values your samples won't be as bland and closer to your real data and edge cases.


    Understanding your source data is crucial to your regex construction whether you are writing it yourself or asking for help.

  • Are you ready for regex?

    Who should be using regular expressions?

    This has been on my mind for a while, there are some people who shouldn't be using regular expressions. People who don't know what regular expressions are for.[:^)] Now you can use regexes in a variety of ways, and you can debate either side on whether certain applications are good uses or bad uses, but that's not what I'm talking about. I'm not even talking about people who don't know how to write regexes, well or at all. I'm talking about people trying to use regular expressions but have no idea what regular expressions do. And I don't mean that you can't decipher a regex pattern by looking at it. I mean you don't understand the general concept that regular expressions match patterns in data.

    I've seen several post over on the regexadvice.com forums in the past few months which go something like “I have this problem someone told me I should use a regular expression” but unless that same someone bother to explain what a regular expressions does or why it is suited for your task, don't be in such a hurry to plug in a regex.

    I'm not saying you need to master regex before using them just get your head around the basics (matching). At the very least you should associate regex to wildcard searches, just more powerful. If you are at least that far, proceed with trying to implement your regex. But if you are thinking something completely different slow down there. Unless you have a basic understanding of what regular expressions do, even simple tasks become way more difficult than they should be. Even if you get someone to help you write a regex and still don't know what it's doing you are likely to continue to incorrectly try to use them.


    A common result from this lack understanding I've been seeing is trying to perform a task solely via writing a regex pattern, that regexes themselves have no capacity to perform. Not that a regex couldn't be a part of the solution but in the end it's not even what will do what the person wants done, usually it a coding task, but the regex can at least do some of the prep work of the task. Even if some implementations of the regex engine allow code to be called, it is still the code doing the heavy lifting. The regex is only finding the data. One could save themselves hours of wasted time if the just understand the basics of what they are getting themselves into.


    Another thing I've seen, related to the not having a basic understanding, is a task that is not only perfectly suited for using a regex but a very common application of a regex task and the person asking the question asking “Is this some that a regex can do?” Of the issues I've mention this one is the most and the least understandable. The most because regular expressions are not simple to everyone, to some it comes easy to others it never comes completely. And some of the documentation isn't the greatest, though there is plenty of good documentation out there these days. So I can understand why some can't get their head about exactly how to write a regex. But it the least understandable when they are asking how to write a regex so common, and usually simple, that dozens of tutorials and/or articles on regexes use the very regex they are asking for as an example.


    I think the main problems of people who fall into this category is


    1. Lack of research. There are more than enough tutorials out on the web and more than a few books that have simple samples to give you an idea of what a regex does. I find it very hard to believe when someone says they couldn't find a regex for basic (US) zip codes. Did you even look? Or are you only using a regex for this task because someone told you that you should and ran out to find someone to write it for you? or are you just cheating on your homework?

    2. Misunderstanding what regular expressions understand. There seems to be more than a few people who think a regex understands the context of the data it matches. That it not only know how to match the data but it know what it means, either in context to their application or to the world at large. Those are the people that think a regex for matching zip codes know what zip codes are used for and where they would be used. Sorry but that's not the case. Regexes understand nothing of what your data means to you. As far as the regex engine is concerned it's just a string of characters. It's up to the regex author know the context of the data they want to match and to shape the regex accordingly to return only the relevant data. This may make it seem like the regex understands the data it's matching but that not what happening. What happen is the person who wrote the regex understood the data and the problem set so well they were able to construct a regex the only match the relevant values.

    3. That regex is a full blown programming language. It's not. I've seen questions about wanting a regex to compare numbers, tell time or do some other function completely outside their realm but something most programming languages have a feature to deal with or let you write code that can. I've never seen any regex documentation promoting such features so I can't imagine why someone thinks a regex can perform these task. Other than my previous point where they saw the results of a well written regex and speculated on what and how much work the regex did. Like I mentioned some implementation allow you to perform function calls but that is more an add-on of the programming environment you are using than a generic regex feature. Realize that regular expressions are one of many features of your programming language, not the other way around.


    If you've read this far and you didn't know what regular expressions did or didn't do before hopefully by now you have some idea. And if you already knew what they did just stop to consider the next time you advise someone to use a regex, you make sure you get across the high level point that you are “matching something (a character pattern) in a string” before you get into the more complex aspects of what a regex can do.

  • You've got your sub-matches in my matches

    Hello boys and girls. Wow it's been a while since I've done this. I want to touch on a very useful but often overlooked feature of regex, grouping. While I haven't been blogging I have been active on a message board here or there. A question I see quite often is “I want to find a match in a string but I don't want part of the match” or “I need the value of this portion of the string” Now often I see solutions to these type of questions that involve look-arounds. Any why they certainly work they aren't the only way to achieve the desired results. New regex users seem to believe they can only access the full match. Most regex engine support groups, where in a match you can access a certain portion of the full match. Groups are identified by parenthesis. Every pair of plain parenthesis is a group. For example the regex pattern


    /(Hello)\x20(world)/


    There are two groups in the regex. The regex itself matches the string “Hello world”, group 1 contains the string “Hello”, group 2 contains the string “world”. Note neither group contains the space between the two words but it is part of the full match. Most implementations of regular expressions allow you a ways to access these groups. They are contained in a collection inside the match object. Now you'll need to consult your regex documentation to know the exact layout but in most of these implementation there is a zero based collection where Item 0 is the full match and Item 1..n is whichever group element your regex contains, if any.


    Now if you notice I said every pair of plain parenthesis is a group. The reason I stress plain is because there are other group constructs, the aforementioned look-arounds being some. Now support for the other constructs vary in implementations so again consult your regex documentation to see which ones you have. The other grouping constructs consist of a open parenthesis immediately followed by a question mark, which then is followed by the characters that define that particular grouping construct. Again consult your documentation to see which characters define what. I'm not going to go into all of them here but they basically fall into two categories. Capturing and non-Capturing. The plain parenthesis I've mentioned above are a capturing group. However capturing requires extra resources in some case you need the extra speed, but you still to group a certain part of the pattern together for either necessity or readability or both. This is where you'll what to using a non-capturing group (?:pattern), a open parenthesis followed immediately by a question mark followed immediately by a colon. The difference here is that the data matched in the group is not add to the collection of submatch in the Match object.

    Taking our previous example and making the first group non-capturing

    /(?:Hello)\x20(world)/


    Where before we had two groups here we only have one. Group 1 contains the string”world”. Now this example is not a very practical use of a non-capturing group. Typically you'd use them in more complex regexes that have a grouping but you really don't care about the sub-matches.


    One more quick thing about capturing groups basically each left parenthesis is the index of the sub-match in the groups collection. So if you have nested parenthesis count every (plain) left parenthesis to know which index to use to reference it. Some of the advance grouping constructs and regex options can affect the ordering but if you are using them hopefully you've read their effects so I won't go over that here.


    /((Hello)|(Goodbye Cruel))\x20(world)/


    The above regex has 4 capturing groups (not counting group 0). Can you find them? Now it should match either the string “Hello world” or “Goodbye Cruel world” Now I want to point out that not all the groups will participate in the match, but the are still part of the Groups collection. There will always be 4 groups, just one will always be empty. Which one depends on which string was matched.

    If “Hello world” was matched the groups are

    1. Hello

    2. Hello

    3. (empty)

    4. world


    If “Goodbye Cruel world” was matched the groups are

    1. Goodbye Cruel

    2. (empty)

    3. Goodbye Cruel

    4. world


    in both case group 0 would be the full match


    If you note in both case two groups contain the same value. Even if you need to know whether “Hello” or “Goodbye Cruel” was match, you certainly don't need to know it twice. Plus the inner parenthesis have different index you'd have to check if you want to use those. This is where you'd use the non-capturing group to simplify your groups collection.


    /((?:Hello)|(?:Goodbye Cruel))\x20(world)/


    Now we are back down to two groups. Group 1 contains either “Hello” or “Goodbye Cruel” depending on which string was matched. Group 2 always contains “world”


    However keep in mind in some cases you'll want to use the inner index do determine which group was matched. So using non-capturing groups isn't necessarily a better thing it just depends on if you need to access those groups or not. But if you are not doing anything with them don't capture them.



    These are just two of the basic grouping constructs and they are general supported across implementations of regex, but not always. But if they are you can use the to easily dissect larger matches.



  • Named Groups to the Rescue

    I was asked to modify some text that had been built incorrectly. Basically insert some text at a certain point. First I use a regex to find the text, then insert the new value within that match.  Now since the inserted value goes inside the matched text I simply wanted use backreferences and the replace method.   Simple right?  Well not so much.

    Now the text is in  a field of various rows of a database table and the text to be inserted comes from another of the fields in the same row and is an alphanumeric value.  So the inserted text value is dynamic, so I can’t simply hard code the replacement text. So the replacement text is built dynamically for each row.  The text to be modified is a certain attribute somewhere in the text.  For this example lets say it’s “id=xyz”, which is constant for all records.  Now the new text will be inserted right after the equals sign.

    So for              

    source  =“ {some stuff} id=xyz {more stuff}”
    newText = “ab1”

     
    you get

     
    “{some stuff} id=ab1xyz {more stuff}”

     

    Simple enough.  You use this regex  \bid=xyz\b to match the text. Then split it in to groups so you can use backreferences in the replace.  So your final regex looks like this:

     \b(bid=)(xyz)\b

     
    Now group 1 contain the text up to your insertion point (id=)

    And group2 contains the text after your insertion point (xyz)

    So your replacement string for your regex is “$1(new data goes here)$2)” , where (new data goes here) = some alphanumeric value pulled from a second field in a row.

     
    Doing this is in .Net my code looked something like this pseudo-code

    Regex regexFind = new Regex(“\b(bid=)(xyz)\b”);

    Get Records

    For each row

                 fieldA = rowFieldA  (source text)

                 fieldB = rowFieldB (insert value)

                fieldA = regexFind.Replace(FieldA,String.Format(“$1{0}$2”,fieldB))

    next

    The Format method of the string create a replacement string for each row.

    Look good?  Works find for our example but there is a problem.

    For our example value of “ab1”  the string format produces “$1ab1$2” which is exactly what we want, but as this field is alphanumeric so it could begin with a number which causes a problem.  Say for the next record the value of the text to be inserted is “12a” the format method produces a replacement string of “$112a$2”, which is not good.  Syntactically it’s fine but it’s not what we want, because instead of trying to inserts some text between group 1 and 2, which is what we want to do, it is trying to insert text between group 112 and group 2.  As there is no group 112 it assumes $112 is literal text so your final result is “id=$112axyz”

    Ok this is where named group become handy (necessary?).  If you used name groups in your regex and replacement string you can avoid this problem

    Change the regex to \b(?<att>bid=)(?<val>xyz)\b

    And your replacement string to “${att}(new data goes here)${val})”

    Now if you are using the string format method there is one more hoop you have to jump through because the  regex engine and format method both use the curly braces there is a conflict and the format method will complain so you have to write it like this

    String.Format("${0}att{1}{2}${0}val{1}","{","}",newValue)

    To get the desired replacement string.
     

    When I started writing this I thought this was the only way to get this to work which means you could only solve this with a regex engine, like .Net, that supported named groups, but I’ve thought of a second way.

    But whenever a group in your replacement string can be followed by a digit you may want to consider using named groups to avoid unexpected surprizes

     


  • Making your regex code ready.

    There are times when regular expression you’ve written or someone written for you needs a little tweaking before you add it to your code and the tweaking is required because the syntax of the language conflicts with your regex.   For example when part of your regex pattern contains a double quote and the language you are using uses double quotes as string delimiters.  If you just cut and paste the pattern in your code the pattern’s quotation will terminate your string prematurely.  Now the code way to fix it is to escape the quotation in pattern. This solution requires altering the regex and how the character is escapes depends on the language being used.   The regex itself allows you to escape character with the \ character.  The language being used may or may not recognize that as escape character for its syntax.   And it may be confusing later when you look at the regex and can’t remember why you escaped a character that the pattern itself doesn’t need it, But there is another way. Hex values

                                                                                                                                 

    Most regex implementations support a hex syntax \x##,  where # is a hex digit.

    So if you use \x22 instead of double quote and \x27 for single quotes the regexes become more cookie cutter ready.

     

    Another useful hex value is \x20 which is a space.   This is especially useful in .Net where there is an option on a regex to ignorewhitespace in the pattern.  Turning this option on allows end of line comments in the regex but with the exception of inside a character class, ignores typed in spaces within the patterns, which would be problematic if a space was part of the pattern to match.  So you could break a working regex if you later decide to add this option. This happened on the Regexlib when the option was first turned on.  A lot of patterns that were written before the switch was flipped suddenly stopped working.

     

    Speaking of .Net when it comes to name groups you can’t use the hex notation to define the group name using the single quote syntax .  However you can avoid any issue with single quotes by using the alternate syntax.

  • Word Break.

    The list-detail design is very commonly used with web pages, where you have a list of links that lead to more detailed information of each entry. Sometimes the text of the list is simply a snippet of a much longer string of text in the detail. A common way to handle this is to use a string function to return the first n characters of the string and display that in the list. The problem with this is that tends to make the break right in the middle of a word. Which isn’t major problem but can be aesthetically displeasing or may accidentally form another word you didn’t mean to put on your site. When facing this issue I came up with a simple regex to allow me to break on whole words. ^(?:[ -~]{n,m}(?:$|(?:[\w!?.])\s)) Where n = the minimum number of characters to match And m = the maximum number of character to allow in the match. Now in instance I’m considering a word to be one or more ACSII non white-space characters. The way it works is after matching n ASCII characters it tries to match either the end of the string or a letter or sentence ending punctuation followed by a white space. So it will accept as many characters, including white spaces as it can up to m and still satisfy the rest of the match. Otherwise it backtracks until the regex is satisfied. So if you wanted a minimum of 2 characters and a maximum of 75 the regex would be ^(?:[ -~]{2,75}(?:$|(?:[\w!?.])\s)) and if you applied it the Gettysburg Address “Four score and seven years ago, our fathers brought forth upon this continent a new nation: conceived in liberty, and dedicated to the proposition that all men are created equal. …” (only the 1st paragraph shown for the example but you could apply the full text) Taking the first match you get “Four score and seven years ago, our fathers brought forth upon this ” There are a few problems with the regex that can be improved. First off it only accepts basic ASCII displayable characters, decimal 32 to 126 with mean the text must be in that range. I did it this way because it give you the US alphabet, digits and commonly used symbols and punctuation which was all I needed at the time. Other characters would need to be added. Also if the first word character count exceeds your maximum length no match will be found You can make this regex a little dynamic by putting inside a function that takes the your string, the max and min values as input.
More Posts Next page »