Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

changing HTML structure, without affecting content

Last post 09-22-2008, 2:43 PM by unholycow. 13 replies.
Sort Posts: Previous Next
  •  09-03-2008, 12:41 PM 45896

    changing HTML structure, without affecting content

    Hello,

    I have a very large XML file, containing some XHTML. In this file, there are hundreds of <p>s which I want to change to <table>s. All of the <p>s I want changed have a specific ID, so it seems like it should be easy to replace them, but I can only get 36 of them. I am using HomeSite 5.2.

    Find what:

    <p id="21">Note:([A-Za-z0-9_ ><.="']*)</p>

    Replace with:

    <table rows="2" cols="2" width="8744"><tr rowheight="0"><td width="630" bordercolors="65793"><p id="4"><fref id="13453"/></p></td><td width="8114" bordercolors="65793"><p id="4">\1</p></td></td></table>

    If I search for <p id="21">Note:, I get 186 hits. If I use the above replacement, it works great, but for only 36 hits. Some (but not all) of the items not found have additional coding in them, but I would think that the >< in the Find what string would take care of them. Unfortunately, I barely have a clue what I'm doing, and I've taken the [A-Za-z0-9_ ><.="']* from somebody else's example without understanding it. I'm not sure I've even titled this request correctly. Thanks so much for any suggestions.

  •  09-03-2008, 2:27 PM 45899 in reply to 45896

    Re: changing HTML structure, without affecting content

    We can't see your file, please post it to a site such as pastebin.com if you want us to test with it.

      [A-Za-z0-9_ ><.="']*     any character of: 'A' to 'Z', 'a' to 'z',
                               '0' to '9', '_', ' ', '>', '<', '.', '=',
                               '"', ''' (0 or more times (matching the
                               most amount possible))


  •  09-03-2008, 4:37 PM 45908 in reply to 45899

    Re: changing HTML structure, without affecting content

    Thanks for the quick reply. To avoid sharing this huge amount of data, I've pasted a representative sample into pastebin here: http://pastebin.com/m7ad9ecee
  •  09-03-2008, 4:45 PM 45909 in reply to 45896

    Re: changing HTML structure, without affecting content

    If you are working with XML then XSLT might be a better option than a regex.

    also your table markup has a typo, no closing tr but extra closing td.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  09-03-2008, 4:56 PM 45912 in reply to 45908

    Re: changing HTML structure, without affecting content

    mash made a good point on not using regex if you can use another method designed specifically for XML, but this regex match pattern worked for me based on your source text:

    <p id="21">.*?</p>

    http://livedocs.adobe.com/coldfusion/5.0/Using_HomeSite/language5.htm


  •  09-04-2008, 1:58 PM 45972 in reply to 45912

    Re: changing HTML structure, without affecting content

    Thanks for the tip. I guess I will have to learn XSLT. I was hoping there'd be an easier option, since I felt like I was close with regex.

    The match pattern you suggested seems so elegant, just what I was hoping for, but in HomeSite it generated an error 17, and I can't find any searchable reference online that describes errors by number. Then I realized that it wouldn't do what I need anyway, because I need to match a pattern at the beginning of some random content, and the pattern at the end (</p>). All the while replacing the content when I change the pattern, so I need a group in the search pattern. I tried this:

    <p id="21">Note:(.*?)</p>

    and replaced with the same as before (after fixing the </td></td> error, thanks mash). But it found the first instance of  <p id="21">Note:, and added the entire document to the group, and then replaced the last </p> in the document, skipping over other instances of <p id="21">Note: content</p>.

  •  09-04-2008, 3:19 PM 45983 in reply to 45972

    Re: changing HTML structure, without affecting content

    Your next attempt might be:

    <p id="21">Note:(?:(?!</p>).)*</p>


  •  09-04-2008, 3:45 PM 45986 in reply to 45983

    Re: changing HTML structure, without affecting content

    I appreciate your continued suggestions. That last one generated no hits, whether doing the Find and Replace or a Find. Faced with the (for me) daunting task of learning XSLT, I messed with it a bit to see what would happen.

    If I try this: <p id="21">Note:(?:(?!</p>).)*

    I get four hits for "<p id="21">Note:" when Finding, but doing my find and replace generated Regular expression error 0:

    I guess I'm just flailing at this point.

  •  09-04-2008, 3:48 PM 45988 in reply to 45986

    Re: changing HTML structure, without affecting content

    Application issues are somehow interfering with your operation, see the pattern in use with your supplied text here:

    http://www.myregextester.com/?r=306


  •  09-08-2008, 2:10 PM 46123 in reply to 45988

    Re: changing HTML structure, without affecting content

    What application would you recommend for doing a find and replace of this nature on a very large xml file (or set of xml files)?
  •  09-08-2008, 6:07 PM 46131 in reply to 46123

    Re: changing HTML structure, without affecting content

    Not necessarily XML-specific apps, but here are two options I have seen:

    http://tools.tortoisesvn.net/grepWin

    http://www.powergrep.com/

    PowerGREP have a larger feature set but it is a commercial application.

    I don't have experience with XML-specific find/replace apps so that's the best I can recommend at this point.


  •  09-09-2008, 5:06 PM 46168 in reply to 46131

    Re: changing HTML structure, without affecting content

    grepWin is really easy to use. Thanks for the link. When I tried this:

     Find: <p id="21">Note:(?:(?!</p>).)*</p>

     Replace with: <table rows="2" cols="2" width="8744"><tr rowheight="0"><td width="630" bordercolors="65793"><p id="4"><fref id="13453"/></p></td><td width="8114" bordercolors="65793"><p id="4">\1</p></td></tr></table>

    it found every instance, but killed the content. The \1 was gone, and there was nothing there. But if I do this:

     Find: <p id="21">Note:([A-Za-z0-9_ ><.="']*)</p>

     Replace with: <table rows="2" cols="2" width="8744"><tr rowheight="0"><td width="630" bordercolors="65793"><p id="4"><fref id="13453"/></p></td><td width="8114" bordercolors="65793"><p id="4">\1</p></td></tr></table>

    That works fine, but only for 36 of the 187 instances. I feel like I'm closer, but the regex is not quite right. Meanwhile, I've been reading up on doing find-replace with XSLT, and XSLT uses regex for this. So it appears that regex is my best hope after all.

    Thanks all for bearing with me.

  •  09-09-2008, 5:12 PM 46169 in reply to 46168

    Re: changing HTML structure, without affecting content

    Find: <p id="21">Note:((?:(?!</p>).)*)</p>

    Your first pattern was missing capture group 1.


  •  09-22-2008, 2:43 PM 46576 in reply to 46169

    Re: changing HTML structure, without affecting content

    Cool, thank you so much, this is now working in grepWin.
View as RSS news feed in XML