PHP PCRE support (preg_match).
Here's some sample data:
<div class="content" >Hello. This is one result.</div> <br />
<div class="content">In this result, we use <div class="big">more DIV tags</div> which makes the regex fail.</div> <br />
<div class="content">And in this result, we use <span class="wide">different</span> <b>HTML</b> <i>tags</i> to style the content.</div> <div class="extra">And this result also has extra content.</div><br />
<div class="content" id="test">And this one uses <div class="big">DIV tags</div>, <b>other HTML tags</b> and a <br /> line break in the post.</div> </table> <br />
What I'm trying to do is write a regex to extract all the content between <div class="content"> and its corresponding closing </div>. The idea is to access an existing page (using curl for example), extract its content and re-display that content, possibly modifying the content (e.g. stripping HTML, etc).
The results for the above sample set would need to be:
- Hello. This is one result.
- In this result, we use <div class="big">more DIV tags</div> which makes the regex fail.
- And in this result, we use <span
class="wide">different</span> <b>HTML</b>
<i>tags</i> to style the content.
- And this one uses <div class="big">DIV tags</div>,
<b>other HTML tags</b> and a <br /> line break in the
post.
Trying a regex like %<div class="content"[^>]*>(.+?)</div>% which is what I've tried will fail on the second and fourth result, giving:
- In this result, we use <div class="big">more DIV tags
- And this one uses <div class="big">DIV tags
Basically the problem is there's no certainty to the content following the closing </div> that I can check with regex. It might be a <br />, it might be another <div> class of some kind, it might be athe closing of a table tag. The entire goal is simply to retrieve all <div class="content"> sections, including any <div>'s they may contain, but not anything outside the corresponding <div> nesting.
Thanks
-fm