Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Making Dynamic XHTML pages valid with a Regex

I have a website that I use primarily to expand and sharpen my web development skills.  The latest effort in that regard had been writing valid markup.  Although there is some usable function to this site I use mainly for practice.  Consequently I don’t work on it full time.  Instead work on it in spurts, with different editors, apply new scripts, new ideas and such at different times so the output is far from consistent.  So I went to a lot of effort  to first make my HTML valid, the later convert that HTML to XHTML and make that valid.  However being that the content of most of the pages is dynamically generated, when validating the XHTML it didn’t occur to me right away that something in the content could make my page not validate in certain situations.  Case in point ampersands. According to the W3C (http://www.w3.org/TR/xhtml1/#C_12) you should not have solitary ampersand characters in your markup.  Ampersands declare the beginning of a entity references.  If you wanted to display the ampersand character you should use the & entity reference instead of (&).  Keep in mind most browsers let you get away with incorrect usage so I didn’t have do any to get my page to render but see as the whole point of writing valid mark up is not to depend on the browser fixing my sloppy code I didn’t want to just let this go.  The first time I encountered this problem on a page the fix was pretty simple.  I was working with ASP so I simply wrapped ASP’s Server.HTMLEncode method around the variables containing the database derived content.  Simple.  Problem solved.  Ummm not quite.  The problem occurred on another page inside a dropdown list.  My first thought to use the HTMLEncode method again was a horrible failure. Unlike my previous fix where I was pulling one row’s column, let’s say First Name, out of a database and sticking in a single variable for latter use, all the markup for the dropdown was being created in a separate function using ADO’s Getstring method. So the variable containing the dropdown contained the full XHTML markup (the Select and option tags).  HTMLEncode turns the less than and greater than signs of the tags into entities which in turn basic displays your intended markup source code on the page.

 

Again the intent in to turn the solitary ampersands (&) into the & entity. So I decide to use a regex to fix this problem. I modified this regex http://www.regexlib.com/REDetails.aspx?regexp_id=626 which matches entities to this

                

 &(?!(?i:\#((x([\dA-F]){1,5})|(104857[0-5]|10485[0-6]\d|1048[0-4]\d\d|104[0-7]\d{3}|10[0-3]\d{4}|0?\d{1,6}))|([A-Za-z\d.]{2,31}));)

 

Which matches ampersands which are not part of an entity.  So using the replace pattern of “&”  I can replace just the solitary ampersand.

 

Now some of you might think this is overkill since I could use VBScript’s Replace function to get the result I want.  And in this very specific case that is true.  There were only a handful of ampersands to replace.  However this issue came up other places where a simple replace would not be easily implemented.  The truth of the matter I didn’t come up with this regex for this problem.  This was originally designed for dealing with XML files that contained solitary ampersands AND entity references.   The VB or VBScript Replace won’t work it that case.  For example that the following line

 

<STATEMENT>(1 &lt; 2) & (1 &lt; 4) are both true </STATEMENT>

 

 

using HTMLEncode you get

 &lt;STATEMENT&gt; (1 &amp;lt; 2) &amp; (1 &amp;lt; 4) are both true &lt;/STATEMENT&gt;

 

Using the VB replace function  you get

<STATEMENT>(1 &amp;lt; 2) &amp; (1 &amp;lt; 4) are both true </STATEMENT>

 

Neither being what you want.

 

So the regex replace allows your content to contain entities and won’t mess them up and you get the desired output

 

<STATEMENT>(1 &lt; 2) &amp; (1 &lt; 4) are both true </STATEMENT>

Published Wednesday, January 26, 2005 4:47 PM by mash

Comments

 

mash said:

Reminds me of some HTML --> XHTML conversion RegExs that I keep meaning to post on my own blog >>makes mental note<<

Just wanted to helpfully point out that (somewhat ironically) you forgot to encode the ampersand in this sentence:

"So using the replace pattern of “&” I can replace just the solitary ampersand."

Presumably you meant "& amp;" (without the space)? ;)
January 27, 2005 7:30 PM
 

mash said:

That is what I meant and in my defense I didn't completely forget, because I did write it up that way. Unforetunately when I put entered it in the blog editor it decoded all my entities. It took me 5 tries to get the example to look right because every time I tried to edit it it would decode all the entities again and I'd have redo everything. And as you see I kept missing one or two.
January 28, 2005 7:48 AM
Anonymous comments are disabled