Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

HTML content, "grouped by" wrapped tags (for lack of a better term)

Last post 01-29-2008, 5:26 PM by Stevezilla00. 8 replies.
Sort Posts: Previous Next
  •  01-23-2008, 4:27 PM 38835

    HTML content, "grouped by" wrapped tags (for lack of a better term)

    Hey all,

    Before I start, I just want to say that below you will find a modifier of "U".  I know a lot of people say this should not be used, so if there is a way that my issue can be resolved where the U modifier won't be needed, great!

     Language: PHP (4)

    Users submit content via a wysiwyg editor we provide.  This means we allow html markup.  I perform cleanup on what the provide (sanity type stuff) and then I must split their content into multiple pages every x number of words (I have this covered).

    Anyways, I am using a regexp to capture all html blocks so that I get the content ("display" words) and its wrapped html markup.

    To elaborate, here is what is desired:

    Original:

     <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    Use regexp to break it into:
    [0] => <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p>
    [1] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p>
    [2] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    Here two methods I've used for the regexp, each one with varying success depending on the content submitted.
    $html_markup:
     <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    --------------------------------------------------------------------------------------
    preg_match_all("/<[^\/]?(\w+)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>(.*)<\/\\1>/iusU" ,$html_markup,$arry,PREG_PATTERN_ORDER);
    GIVES:
     <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p>
    --------------------------------------------------------------------------------------
    This one is VERY close.  This almost gives me exactly what I want!
    preg_match_all("/<[^\/](.+)>(.*)<\/(.+)>/iusU",$html_markup,$arry,PREG_PATTERN_ORDER);
    GIVES:
    [0] => <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span>
    [1] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span>
    [2] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span>
    Notice how it is missing the remaining html tags?  for example, the </span></strong></p> that should appear after the </span> that is there.
    Any help would be greatly appreciated.
  •  01-23-2008, 5:10 PM 38838 in reply to 38835

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    How do you plan to handle nested tags?  Also, it would seem that your intended matches are just <p>...</p> blocks, is that the case or is it just the example?

    Raw Match Pattern:
    <p[^>]*>.*?</p>
    
    
    PHP Code Example:
    <?php preg_match_all('~<p[^>]*>.*?</p>~s',$sourcestring,$matches); echo print_r($matches,true); ?>
    $matches array: ( [0] => Array ( [0] => <p style='TEXT-ALIGN: center'><strong><span style='TEXT-DECORATION: underline'><span style='FONT-FAMILY: Times New Roman' size='3;'>Strange and Unexplained Happenings</span></span></strong></p> [1] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Author: ZeusFluff.</span></strong></p> [2] => <p><strong><span style='FONT-FAMILY: Times New Roman' size= '3;'>Date Started: 7/14/06. Date Finished: 7/14/06.</span></strong></p> ) )

  •  01-23-2008, 7:16 PM 38845 in reply to 38838

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    Hi,

     Thank you for the reply.  This works almost 100%!  But, yes, I do need to support nested tags.  For instance, if someone has something like this:

    <p style='MARGIN: 0in 1in 0pt; TEXT-ALIGN: justify'>“Ok I say I&#39;m not gonna hurt you and then you go and do something like that,” Noel said pulling the blood stained dagger out of his chest, “it&#39;s just not sane. Ok I&#39;m gonna put him down and maybe we can talk about this.” Noel said calmly. <strong><em>Slowly lessening his grip on Allan&#39;s throat he let him drop to the forest floor. Allan sweat drenched and fearing he might need a new change of under garments soon looked from Noel to the blood stained dagger lying not three feet away.</em></strong> <span style='FONT-WEIGHT: normal; COLOR: red'><em>He grabbed it and lunged a third time, this time though he hit nothing but the tree to the other side of Noel.</em></span></p>

  •  01-23-2008, 8:56 PM 38847 in reply to 38845

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    That matches fine, I was concerned about <p>...</p> blocks nested inside a <p>...</p> block, if that's not a possibility then the pattern should work fine.
  •  01-25-2008, 7:07 PM 38950 in reply to 38847

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    Alright!  Very close here.

    Here is my next challenge and for the life of me, I can't get this next thing to work right.

    <p class="classname" style='somestyle' id='someid'>It felt like we were dancing around, trying to figure each other out, and she was looking to my reaction and decision to tell her <em>something here</em>. I had no idea what I was doing. I felt like I was in a foreign country, and I didn’t know the language. Why the hell did I care that I passed her test? But I did. I most definitely did.</p>

    I'm trying to break this up using regexp so that I get a structure like:

    [0]
         [0] => <p class="classname" style='somestyle' id='someid'>
         [1] => <em>
    [1]
         [0]=> It felt like we were dancing around, trying to figure each other out, and she was looking to my reaction and decision to tell her <em>something here</em>. I had no idea what I was doing. I felt like I was in a foreign country, and I didn’t know the language. Why the hell did I care that I passed her test? But I did. I most definitely did.
         [1]=> something here
    [2]
         [0]=></p>
         [1]=></em>

    or something like that.

     

  •  01-25-2008, 9:48 PM 38954 in reply to 38950

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    I fear given the simplicity of your original sample and my understanding of your goal from the original question, that's far from very close.

    Please be specific about what exactly you want to accomplish, do you want to strip tags, find some split points then reassemble with tags?

    Also, please please explain your latest sample matches further.  Thanks.


  •  01-29-2008, 12:46 PM 39043 in reply to 38954

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    Hi ddrudik,

    I apologize for that.  Let me elaborate further on my original post.

    We have a site where we allow users to submit content.  This content is in html (we provide a wysiwyg) and many people copy and paste from MS Word.

    So, I first go through the process of cleaning up what they submit so that what I have left is the "cleansed" html (I use tidy and some other logic to clean up the source).  By the time the cleaning process is complete, I'm left with a "workable" set of html.

    I then need to split the content up into multiple pages.  We define a page as having N number of words.  So, this means that I have to be able to split apart the html, loop through the words until I hit N, end that page (by closing up the markup with the appropriate closing tags, etc), then start a new page and make sure that the starting tags are a continuation of the tags that the content was in when I split it.

    Here is an example: http://fanfictiondev.myfandoms.com/index.php/fanfiction/show/317?site_id=13

    But the problem with that example is prevalent even on the first page.  You'll notice that the paragraph, " Excellent. I’m so relieved to hear it wasn’t a date that I was suddenly myself and instead of trying to figure out what he hell was happening, I acted like myself. I kissed her good night. You know…straight girl kissing good night. Not that I ever thought of it that way before, but I guess that’s ....." should actually have only the word Excellent. as being italicized.

    A further example of what I'm needing to do:

    CONTENT:

    <p>What was supposed to happen next? What the hell was I supposed to do next?</p>
    <blockquote>“I’m actually meeting my brother in Brooklyn for dinner today.”</blockquote>
    <p><span style='font-weight:bold;'><em>That is excellent</em></span>. I’m so relieved to hear it wasn’t a date that I was suddenly myself and instead of trying to figure out what he hell was happening.</p>
    <p>But then things changed. <strong>Big time</strong>. I found myself not in the usual “peck and run” mode.</p>
    <p style='color:blue;'>And she did</p>

    RESULTS OF PAGE SPLITTING IF WE USE EVERY 3 WORDS:

    PAGE 1:
    <p>What was supposed</p>

    PAGE 2:
    <p>to happen next?</p>

    PAGE 3:
    <p>What the hell</p>

    PAGE 4:
    <p>was I supposed</p>

    PAGE 5:
    <p>to do next?</p>

    PAGE 6:
    <blockquote>“I’m actually meeting</blockquote>

    PAGE 7:
    <blockquote>my brother in</blockquote>

    PAGE 8:
    <blockquote>Brooklyn for dinner</blockquote>

    PAGE 9:
    <blockquote>today.”</blockquote>
    <p><span style='font-weight:bold;'><em>That is</em></span></p>

    PAGE 10:
    <p><span style='font-weight:bold;'><em>excellent</em></span>. I’m so</p>

    PAGE 11:
    <p>relieved to hear</p>

    PAGE 12:
    <p>it wasn’t a</p>

    PAGE 13:
    <p>date that I</p>

    PAGE 14:
    <p>was suddenly myself</p>

    PAGE 15:
    <p>and instead of</p>

    PAGE 16:
    <p>trying to figure</p>

    PAGE 17:
    <p>out what the</p>

    PAGE 18:
    <p>hell was happening,</p>

    PAGE 19:
    <p>But then things</p>

    PAGE 20:
    <p>changed. <strong>Big time</strong>.</p>

    PAGE 21:
    <p>I found myself</p>

    PAGE 22:
    <p>not in the</p>

    PAGE 23:
    <p>usual “peck and</p>

    PAGE 24:
    <p>run” mode.</p>
    <p style='color:blue;'>And</p>

    PAGE 25:
    <p style='color:blue;'>she did</p>

     

    Hopefully this makes a bit more sense.

    Thank you for your continued help (and patience).

  •  01-29-2008, 1:34 PM 39048 in reply to 39043

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    While I understand your goal I don't readily see a quick solution to the issue.


  •  01-29-2008, 5:26 PM 39065 in reply to 39043

    Re: HTML content, "grouped by" wrapped tags (for lack of a better term)

    My blog post on automatically generating an HTML summary covers a similar type of problem using JavaScript. The basic idea is that you'll want to loop over your content matching one word or HTML tag at a time while keeping track of word count and open HTML tags. Then, each time you hit the max word threshold and start a new page, you can close open tags and re-open them for the next page.

    My regex-centric blog :: JavaScript regex tester
View as RSS news feed in XML