By
far the most common request I see in people wanting regex help is
someone wanting to use a regex to parse HTML. Generally I ignore those
questions. If I do respond, my response is “Don’t use a regex to parse
HTML. Use the HTML DOM” Same is true for XML, with the XML DOM, but doing it for HTML is even worse. Not
that my advice stops people from trying to do it anyway but giving them
proper warning I feel no remorse in the self-inflicted pain they illicit
ignoring my advice.You
may ask why I advise people against using a regex for this task? First
full disclosure, I have, do, and will use regexes to parse HTML. But
this is a case of “Do as I say, not as I do.” Why is it OK for me but
not for you? you may ask. Simple, I am very knowledgable of both
Regex and HTML. If I do embark on such a perilous task, I know exactly
what I’m getting into, what the risk are and what pitfalls to expect.
In most case I am the author of the HTML so I know how the markup will
be constructed. Most people asking for a regex to parse HTML are expert
in neither and have no idea of what to prepare for. Which is reason one
I tell people not to use regex for HTML. It is not as simple as you
want it to be. Most people knowledgeable in regex will tell you to
never use are regex for HTML. I won’t go that far but if someone needs
help to construct such a regex then they shouldn’t be using a regex for
this task. If you can write it yourself, you still shouldn’t do it but
you are the only one who has to deal with the pitfalls so you are the
only one who has to suffer.Now
I’ve seen other article advising not to use regex against HTML, so I
never really considered writing about it. I thought it had been covered
enough. But since people keep asking for these types of pattern I
thought I should address it as well. And also I have insight that most
of those other articles don’t address. The
biggest issue by far is that people asking for these patterns, as well
as most of those who will trying to write patterns for those who ask, is
that they don’t know HTML very well. HTML is one of the languages a
lot of people think they know better than they actually do. And that’s
where the problem start. For the last few year I'm been learning about Web Standards, which in a nutshell includes writing valid HTML and following best practices of web development. One of the things I learn in do this is that most HTML in the real world is poorly written. This compounds the problem of using regex against HTML, since regexes are all about finding patterns. So even if you know the rules of valid HTML you have no guarantee the HTML you are working with follows those guidelines. One
of the caveats of the Posting Guidelines is to show the original text
and not a made up sample. Although you shouldn’t be trying to use a
regex on HTML it points out why this guideline is in place. The
majority of people asking this type of question will either make up the
sample HTML or show a simplified version of the markup highlighting the beginning and the end of what they want to match and writing something
like “blah blah blah” for the content in between. The problem here is a
lack of understanding of how regex works and how HTML can be written.Commonly
the request is to match a certain div or table. The problem here is
they assume that these elements won’t be nested with child elements of
the same type. A table can contain other tables, which very common in
pages that use tables for layout. And just about any page that has div
elements will contain nested divs.Often the request is I want to match a particular div and they have some phony HTML like<div id=”findMe”> blahblah
blah
</div>And
some kind soul may write them a regex pattern that matches that sample.
The problem is real HTML could break any pattern they write. The
person asking for help incorrectly assumes that a regex will understand
HTML and know the </div> is the closing tag of <div
id=”findMe”>. The person answering may think that all that is need
is after finding the open div just search until the close div occurs.
This shows how made up data and regex construction don’t go together.
The pattern to find a block with no nested elements is simple. The
pattern to find one that have nested element get significantly harder
the deeper the nesting. The “blah blah blah” could easily be more mark
up, include any number of div’s. There could easily be eight
</div> before the one that closes <div id=”findMe”>. If the actual source is<div id=”findMe”> <div>content <div> More content</div> continued </div><div>content</div>
<div>content More content</div>
</div>Most regexes will choke trying to correctly find findMe’s closing tag, especially if the nesting depth is unknown. While a powerful regex engine may be able to balance to tags that doesn’t mean you are in the clear.Of
course the other assumption that leads to trouble is that the HTML is
either well written or well formed. Most HTML out in the wild is not
well written, in that it wouldn’t pass validation. Run most pages
through the W3C’s validator http://validator.w3.org/
Don’t be shocked if the error count is in triple digits. Unlike a
web browser a regex isn’t a HTML parser so it isn’t going to catch any
HTML errors any of which may cause your pattern to fail. A web browser
is very fault tolerant so it can handle errors and adjust. A regex
won’t.<div id=”findMe”> blahblah
blah
<div> Oops forgot to close the previous Div
</div>So
even if you managed to write pattern that could handle nested tags of
the same type, which itself is tough enough. Now you have account for
unclosed tags as well. Which is one of many very likely errors you’d
need to account for.Even
if the HTML is well written and passes validation doesn’t mean it is
well formed. Unlike XML or XHTML, HTML doesn’t have to be well formed
to be valid. The assumption by many is that every open tags has a
closing match. Sorry that simply isn’t the case. Several tags have
optional closing tags. While the above snippet isn’t valid the below
snippet is<p id=”findMe”> blahblah
<p> More stuff</p>Some
may think they can just test for either the first occurrence of
</p> or <p> after <p id=”findMe”> to find the
complete<p id=”findMe”> element but what if the markup is so<p id=”findMe”> <span>blah</span>blah
<ul> <li> <li><p> More Stuff</p></ul><p> More stuff</p> or for a nested table<table id=”findMe”> <tbody> <tr><td><td><table> <tr><td><td><tr><td><td></tr>
</table> <tr><td><td> <tr><td><td></table>The above samples are representing the source that could exist in an attempt to find the element with the id of “findMe”.Also formatting has to be considered. What is the difference between these two snippets of mark up?<p class=demo id=test> Some Text</p>and<p id=”test” class=”demo”>Some Text</p>As far as a browser is concerned they are the same. A regex would see them as two different strings.Another
common thing I see is when people want use a regex against HTML they
want to try to find an element based on an attribute. Several incorrect
assumptions generally are made- The attribute they are looking for will always be the first
- The order of the attribute are fixed
- That the attributes are always quoted
- All attributes assign a value.
None of the above have to be true for valid HTMLTake the following and assume one wanted to find all the elements with findMe attributes<p class=”findMe”> blahblah
<p> More stuff</p><p class=”hello findMe world”> blah <span class=’findMe also’>blah</span><p> More stuff</p><p class=findMe> blahblah
</p>
<p class=”can you findMe”> More stuff</p>
<p> More stuff</p>
Trying to find the find each element with findMe in the class with a regex is way more trouble than it’s worth.
Now
while these example are contrived and you most likely won’t get such a
mix of styles. You could get one that you weren’t expecting. And if
you HTML was created by a team of developer you may get that mix after
all. And remember you can’t even assume the HTML is correctly written.
So the moment you notice you’ve missed a style you didn’t account for
you have to adjust your pattern which could be a huge nightmare.
Another
thing to consider especially when searching by attributes is the a
regex can’t reuse consumed text. Consider the following
<div class=”findMe”>
<p>Blah</p>
<p class=”findMe”> Some text <b class=”findMe”>to find</b></p>
<p>Blah</p>
</div>
<div> Blah Blah Blah</div>
Again
if the task is to find all elements with a find me class. Let’s assume a
regex can match the div element. Once this is done the regex engine
will proceed to find the next match after the div element’s closing
tags. It will not go back inside the div to find the class on the
elements within the div.
Instead
of trying to hit a moving target with a regex, the HTML DOM would parse
all the above samples, even the incorrect one, because that's it job. Of course if there are
errors the document may not be parse exactly as you intended but it will
be correctly parsed. From there it is fairly easy to find the element
you need. Reformatting the markup is very unlikely to break code using the the DOM.
For example using the popular JavaScript library JQuery the follow snippet would find all all the elements with a findMe class.
var foundThem = $(".findMe");
Or in the versions of JavaScript supporting getElementByClassName
var foundThem = document.getElementsByClassName('findMe');
And
quickly XML which has stricter syntax rules, well formed, all
attributes quoted, may seem more regex friendly but there are other ways
to parse XML. Xpath or XQuery both more maintainable than a regex
would be for this task and easier to use and understand in most cases.
You can use a regex which may work, some of the time, or the DOM, something that should work all the time. In the end the choice is yours.