Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Michael Ash's Regex Blog

Regex Musings

Got (X)HTML? Use the DOM



By far the most common request I see in people wanting regex help is someone wanting to use a regex to parse HTML.  Generally I ignore those questions.  If I do respond, my response is “Don’t use a regex to parse HTML. Use the HTML DOM” Same is true for XML, with the XML DOM, but doing it for HTML is even worse.  Not that my advice stops people from trying to do it anyway but giving them proper warning I feel no remorse in the self-inflicted pain they illicit ignoring my advice.

You may ask why I advise people against using a regex for this task?  First full disclosure, I have, do, and will use regexes to parse HTML. But this is a case of “Do as I say, not as I do.”  Why is it OK for me but not for you? you may ask.   Simple, I am very knowledgable of both Regex and HTML. If I do embark on such a perilous task,  I know exactly what I’m getting into, what the risk are and what pitfalls to expect.  In most case I am the author of the HTML so I know how the markup will be constructed. Most people asking for a regex to parse HTML are expert in neither and have no idea of what to prepare for.  Which is reason one I tell people not to use regex for HTML.  It is not as simple as you want it to be.   Most people knowledgeable in regex will tell you to never use are regex for HTML.  I won’t go that far but if someone needs help to construct such a regex then they shouldn’t be using a regex for this task.  If you can write it yourself, you still shouldn’t do it but you are the only one who has to deal with the pitfalls so you are the only one who has to suffer.

Now I’ve seen other article advising not to use regex against HTML, so I never really considered writing about it.  I thought it had been covered enough.  But since people keep asking for these types of pattern I thought I should address it as well. And also I have insight that most of those other articles don’t address.  

The biggest issue by far is that people asking for these patterns, as well as most of those who will trying to write patterns for those who ask, is that they don’t know HTML very well.  HTML is one of the languages a lot of people think they know better than they actually do.  And that’s where the problem start. For the last few year I'm been learning about Web Standards, which in a nutshell includes writing valid HTML and following best practices of web development.  One of the things I learn in do this is that most HTML in the real world is poorly written.  This compounds the problem of using regex against HTML, since regexes are all about finding patterns.  So even if you know the rules of  valid HTML you have no guarantee the HTML you are working with follows those guidelines.

One of the caveats of the Posting Guidelines is to show the original text and not a made up sample.  Although you shouldn’t be trying to use a regex on HTML it points out why this guideline is in place.  The majority of people asking this type of question will either make up the sample HTML or show a simplified version of the markup highlighting the beginning and the end of what they want to match and writing something like “blah blah blah” for the content in between.  The problem here is a lack of understanding of how regex works and how HTML can be written.

Commonly the request is to match a certain div or table.  The problem here is they assume that these elements won’t be nested with child elements of the same type.  A table can contain other tables, which very common in pages that use tables for layout.  And just about any page that has div elements will contain nested divs.
Often the request is I want to match a particular div and they have some phony HTML  like

<div id=”findMe”>
    blah

blah

blah

</div>

And some kind soul may write them a regex pattern that matches that sample.  The problem is real HTML could break any pattern they write.  The person asking for help incorrectly assumes that a regex will understand HTML and know the </div> is the closing tag of <div id=”findMe”>.  The person answering may think that all that is need is after finding the open div just search until the close div occurs.  This shows how made up data and regex construction don’t go together.  The pattern to find a block with no nested elements is simple.  The pattern to find one that have nested element get significantly harder the deeper the nesting.  The “blah blah blah” could easily be more mark up, include any number of div’s.  There could easily be eight </div> before the one that closes <div id=”findMe”>.
If the actual source is

<div id=”findMe”>
    <div>content <div> More content</div> continued </div>

<div>content</div>

<div>content More content</div>

</div>
Most regexes will choke trying to correctly find findMe’s closing tag, especially if the nesting depth is unknown.  
While a  powerful regex engine may be able to balance to tags that doesn’t mean you are in the clear.

Of course the other assumption that leads to trouble is that the HTML is either well written or well formed.  Most HTML out in the wild is not well written, in that it wouldn’t pass validation. Run most pages through the W3C’s validator http://validator.w3.org/   Don’t be shocked if the error count is in triple digits.  Unlike a web browser a regex isn’t a HTML parser so it isn’t going to catch any HTML errors any of which may cause your pattern to fail.  A web browser is very fault tolerant so it can handle errors and adjust.  A regex won’t.
<div id=”findMe”>
    blah

blah

blah

<div>

Oops forgot to close the previous Div

</div>

So even if you managed to write pattern that could handle nested tags of the same type, which itself is tough enough.  Now you have account for unclosed tags as well.  Which is one of many very likely errors you’d need to account for.

Even if the HTML is well written and passes validation doesn’t mean it is well formed.  Unlike XML or XHTML, HTML doesn’t have to be well formed to be valid.  The assumption by many is that every open tags has a closing match.  Sorry that simply isn’t the case.  Several tags have optional closing tags.  While the above snippet isn’t valid the below snippet is

<p id=”findMe”>
    blah

blah

<p> More stuff</p>

Some may think they can just test for either the first occurrence of </p> or <p> after <p id=”findMe”> to find the complete<p id=”findMe”> element but what if the markup is so

<p id=”findMe”>
    <span>blah</span>

blah

<ul>
    <li>
    <li><p> More Stuff</p>
</ul>
<p> More stuff</p>

or for a nested table

<table id=”findMe”>
    <tbody>
        <tr><td><td><table>
                    <tr><td><td>

<tr><td><td></tr>

                     </table>
        <tr><td><td>
        <tr><td><td>
</table>

The above samples are representing the source that could exist in an attempt to find the element with the id of “findMe”.

Also formatting has to be considered.  What is the difference between these two snippets of mark up?
<p class=demo id=test> Some Text</p>
and
<p
 id=”test”
 class=”demo”
>
Some Text
</p>

As far as a browser is  concerned they are the same. A regex would see them as two different strings.

Another common thing I see is when people want use a regex against HTML they want to try to find an element based on an attribute. Several incorrect assumptions generally are made
  1. The attribute they are looking for will always be the first
  2. The order of the attribute are fixed
  3. That the attributes are always quoted
  4. All attributes assign a value.


None of the above have to be true for valid HTML
Take the following and assume one wanted to find all the elements with findMe attributes

<p class=”findMe”>
    blah

blah

<p> More stuff</p>
<p class=”hello findMe world”>
    blah
    <span class=’findMe also’>blah</span>
<p> More stuff</p>
<p class=findMe>
    blah

blah

</p>
<p class=”can you findMe”> More stuff</p>
<p> More stuff</p>

Trying to find the find each element with findMe in the class with a regex is way more trouble than it’s worth.

Now while these example are contrived and you most likely won’t get such a mix of styles.  You could get one that you weren’t expecting.  And if you HTML was created by a team of developer you may get that mix after all.  And remember you can’t even assume the HTML is correctly written. So the moment you notice you’ve missed a style you didn’t account for you have to adjust your pattern which could be a huge nightmare.

Another thing to consider especially when searching by attributes is the a regex can’t reuse consumed text. Consider the following
<div class=”findMe”>
    <p>Blah</p>
    <p class=”findMe”> Some text <b class=”findMe”>to find</b></p>
    <p>Blah</p>
</div>
<div> Blah Blah Blah</div>

Again if the task is to find all elements with a find me class. Let’s assume a regex can match the div element.  Once this is done the regex engine will proceed to find the next match after the div element’s closing tags.  It will not go back inside the div to find the class on the elements within the div.


Instead of trying to hit a moving target with a regex, the HTML DOM would parse all the above samples, even the incorrect one, because that's it job.   Of course if there are errors the document may not be parse exactly as you intended but it will be correctly parsed. From there it is fairly easy to find the element you need. Reformatting the markup is very unlikely to break code using the the DOM.

For example using the popular JavaScript library JQuery the follow snippet would find all all the elements with a findMe class.

var foundThem = $(".findMe");


Or in the versions of JavaScript supporting getElementByClassName

var foundThem = document.getElementsByClassName('findMe');


And quickly XML which has stricter syntax rules, well formed, all attributes quoted, may seem more regex friendly but there are other ways to parse XML.  Xpath or XQuery both more maintainable than a regex would be for this task and easier to use and understand in most cases.

You can use a regex which may work, some of the time, or the DOM, something that should work all the time.  In the end the choice is yours.

Sponsor
Published Sunday, January 23, 2011 9:04 PM by mash
Filed under: , ,
Anonymous comments are disabled