Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Regular expression to get sections of an HTML page

Last post 07-07-2009, 6:29 PM by Aussie Susan. 9 replies.
Sort Posts: Previous Next
  •  07-06-2009, 2:10 PM 54765

    Regular expression to get sections of an HTML page

    PHP PCRE support (preg_match).

    Here's some sample data:

    <div class="content" >Hello. This is one result.</div> <br />

    <div class="content">In this result, we use <div class="big">more DIV tags</div> which makes the regex fail.</div> <br />

    <div class="content">And in this result, we use <span class="wide">different</span> <b>HTML</b> <i>tags</i> to style the content.</div> <div class="extra">And this result also has extra content.</div><br />

    <div class="content" id="test">And this one uses <div class="big">DIV tags</div>, <b>other HTML tags</b> and a <br /> line break in the post.</div> </table> <br />

    What I'm trying to do is write a regex to extract all the content between <div class="content"> and its corresponding closing </div>. The idea is to access an existing page (using curl for example), extract its content and re-display that content, possibly modifying the content (e.g. stripping HTML, etc).

    The results for the above sample set would need to be:

    • Hello. This is one result. 
    • In this result, we use <div class="big">more DIV tags</div> which makes the regex fail.
    • And in this result, we use <span class="wide">different</span> <b>HTML</b> <i>tags</i> to style the content.
    • And this one uses <div class="big">DIV tags</div>, <b>other HTML tags</b> and a <br /> line break in the post.

    Trying a regex like %<div class="content"[^>]*>(.+?)</div>% which is what I've tried will fail on the second and fourth result, giving:

    • In this result, we use <div class="big">more DIV tags
    • And this one uses <div class="big">DIV tags

    Basically the problem is there's no certainty to the content following the closing </div> that I can check with regex.  It might be a <br />, it might be another <div> class of some kind, it might be athe closing of a table tag. The entire goal is simply to retrieve all <div class="content"> sections, including any <div>'s they may contain, but not anything outside the corresponding <div> nesting.

    Thanks

    -fm

     

  •  07-06-2009, 3:42 PM 54781 in reply to 54765

    Re: Regular expression to get sections of an HTML page

    There are non-regex methods to parse HTML content that you might consider, such as:

    <pre>
    <?php
    $html='<div class="content" >Hello. This is one result.</div> <br />
    <div class="content">In this result, we use <div class="big">more DIV tags</div> which makes the regex fail.</div> <br />
    <div class="content">And in this result, we use <span class="wide">different</span> <b>HTML</b> <i>tags</i> to style the content.</div> <div class="extra">And this result also has extra content.</div><br />
    <div class="content" id="test">And this one uses <div class="big">DIV tags</div>, <b>other HTML tags</b> and a <br /> line break in the post.</div> <br />'
    ;
    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $nodeList = $doc->getElementsbytagname('div');
    foreach ($nodeList as $key=>$node) {
            if($node->getAttribute("class")=="content"){
                    echo '<font color=red>'.$key.':'.$node->nodeValue.'</font><br>';
            } else {
                    echo $key.':'.$node->nodeValue.'<br>';
            }
    }
    ?>


  •  07-06-2009, 8:58 PM 54801 in reply to 54781

    Re: Regular expression to get sections of an HTML page

    Thanks, I'll mess around with that, but is there actually a way to do this with regex?

    I was reading around on here and someone else mentioned that there is a (complicated) regex that can do nested stuff like this. I'm just trying to understand if/how it can be done for future use, in case I ever encounter something involving nests that's non-HTML (think source code, custom data files, etc)

    Thanks

    -fm

     

  •  07-06-2009, 10:29 PM 54803 in reply to 54801

    Re: Regular expression to get sections of an HTML page

    (?xis)                                                              (?# set pattern modifiers)
               (                                                          (?# first pattern group)
                     <div[^>]*+>                                     (?# match the opening tag)
                             (?:
                                  (?:(?!</?div[^>]*+>).)*+        (?# match all the childnodes of that opening tag, *+ means posessive quantifier, no save states)
                                          |                                (?# OR)
                                  (?1)                                   (?# do it recursively, 1 points to the first pattern group)
                              )*                                          (?# zero or many times)
                     </div>                                             (?# closing tag)
                 )                                                         (?# end first pattern group)
    http://portal-vreme.ro
  •  07-07-2009, 12:09 AM 54807 in reply to 54801

    Re: Regular expression to get sections of an HTML page

    Try:

    <div\b(<div\b(?1)</div>|(?!</?div).)*</div>

    The start and end look very much as you would expect. The middle group is actually recursive and is broken into two alternate parts with the second being the easiest to understand - it is really just the '.' operator that is protected from going over a "<div" or "</div>" tag.

    If a nested "<div\b" tag is found then the first alternate is used which performs the recursive call to match group #1 and then processes the trailing tag.

    In general regexs cannot handle this type of problem, sometimes called the matching parentheses problem after one of the common instances of this in examining a standard mathematical expression that contains parentheses. PCRE has the '(?1)' style recursive reference added as an extension which returns the complete text form the outer occurrence. The .NET regex allows this in a different way in that it can record the individual captures for each match group (in effect this adds another layer under each match group) and allow you to treat the captures as a stack - adding a new capture for each nested occurrence and popping one off when each finishes. The pattern syntax for each is quite different and these are the only two regex variants that I know of that allow this.

    However all of the versions of the HTML and XML DOMs I have used have this pretty much as a standard operation. 

    Susan 

    Edit: I took so long to enter my posting (for a number of reasons!!!) that Killahbeez got in before me. The patterns are very similar but I prefer to end the '<div' part of my patterns with '\b' which stops matches against text such as "<division". Also, he says the modifiers to use which I forgot! 

  •  07-07-2009, 1:05 AM 54816 in reply to 54807

    Re: Regular expression to get sections of an HTML page

    Susan, if you do not put posessive quantifier your regex will not work as you expect.

     <div\b(<div\b(?1)</div>|(?!</?div).)*+</div>

     OR

      <div\b(<div\b(?1)</div>|(?:(?!</?div).)*)*</div>


    http://portal-vreme.ro
  •  07-07-2009, 7:11 AM 54838 in reply to 54816

    Re: Regular expression to get sections of an HTML page

    Thanks for picking that up - I guess I didn't look closely enough at the output form Doug's regex tester.

    However the following is even simpler and does seem to work in all of the test cases:

    <div\b((?!</?div\b).|(?R))*</div>

    with the 'singleline' (and 'extended' [x]) option set.

    Susan 

  •  07-07-2009, 8:15 AM 54840 in reply to 54838

    Re: Regular expression to get sections of an HTML page

    this becomes practically identical to my regex except with \b, which is a good ideea,
    and the fact that you don't get in first alternation all you can get, but you catch it with the last *, which creates more save states.

    The better way to do it is to put  (?:(?!</?div\b).)++

    From Jeffrey Friedl, "Mastering Regular Expressions" 3-rd edition (page 477)


    http://portal-vreme.ro
  •  07-07-2009, 9:54 AM 54856 in reply to 54765

    Re: Regular expression to get sections of an HTML page

    fmillion:

    What I'm trying to do is write a regex to extract all the content between <div class="content"> and its corresponding closing </div>.

    Isn't the parsing issue more complicated that simply locating all div's regardless of class?


  •  07-07-2009, 6:29 PM 54889 in reply to 54856

    Re: Regular expression to get sections of an HTML page

    Yes and no.

    Yes because I read the OP as wanting to start with a div tag that has a specific attribute and value (in which case the pattern Killahbeez and I have been throwing around would need to be extended - although the basic concept is right).

    No because, once you have found the starting tag, any nested div tags (regardless of their attributes and values) will have an ending tag that needs to be considered.

    However, I understood the OP recognised that the DOM approach was probably the way to go (and I think it is) but was interested in the approach that would be needed to use a regex pattern. Therefore I see this as more of a discussion on technique rather than the development of a specific pattern to solve the OP's original problem.

    Susan 

View as RSS news feed in XML