Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Find / Replace text within known tags

Last post 11-30-2008, 5:30 PM by Aussie Susan. 4 replies.
Sort Posts: Previous Next
  •  11-27-2008, 2:40 PM 48879

    Find / Replace text within known tags

    Hi,

    I am completely new here and completely new to RegEx. I have been trying to work this out on my own but I need help.

    I have hundreds of HTML files to edit and I have been given a computer with PowerGrep. I hope to learn a lot as I go along.

    I have managed to do a number of find and replaces that were very simple but now I have a more difficult problem and I need a little help.

    The following is an example of some of the text within the file:

    <!--[if gte mso 9]>
    <xml>
     <o:DocumentProperties>
      <o:Template>Normal</o:Template>
      <o:Created>2006-09-25T14:05:00Z</o:Created>
      <o:LastSaved>2006-09-29T17:33:00Z</o:LastSaved>
      <o:Pages>1</o:Pages>
      <o:Words>347</o:Words>
      <o:Characters>1842</o:Characters>
      <o:Lines>15</o:Lines>
      <o:Paragraphs>4</o:Paragraphs>
      <o:CharactersWithSpaces>2185</o:CharactersWithSpaces>
      <o:Version>11.5606</o:Version>
     </o:DocumentProperties>
    </xml>
    <![endif]-->

    What I would like to do is delete everything that you see above. The only constants across the files are that they begin with <!--[if gte mso 9]> and end with <![endif]-->

    Sometimes, the  <!--[if gte mso 9]> might begin again, so I only want to find from <!--[if gte mso 9]> and up to the first instance of <![endif]-->

     I anticipate learning a great deal in the next few weeks - your guidance and help will be really appreciated

  •  11-27-2008, 7:49 PM 48891 in reply to 48879

    Re: Find / Replace text within known tags

    Welcome to the forum.

    I;ve not used it myself, but according to the documentation PowerGrep understands the Perl and .NET style of regex patterns so the following should get you started at least. Try the pattern:

    <!--\[if\x20gte\x20mso\x209\]>((?!<!\[endif\]-->).)*<!\[endif\]-->

    with the 'singleline' (also called "dot matches newline" in some contexts) set and possibly with the 'ignore case' option set. The replacement string should be blank.

    I'm not sure what you mean by "Sometimes the .... might begin again". If the "if"s can be nested, then the above won't work, but if you mean that the 'if-endif' sequence can appear multiple times in the same document then the pattern will work but you will need to do whatever withn PowerGrep to tell it to match multiple times.

    The way this works is to first match the begining phrase (<!--\[if\x20gte\x20mso\x209\]>) where the spaces have been replaced with '\x20'. In some regexs you can use space characters but using '\x20' to represent a space makes the character and the number explicit (e.g. is "   " 1, 2, 3 or 4 spaces?????). You can also use '\w' to represent a whitespace character but this would also include tabs and newlne characters - in this case that may be OK but only you can tell baxed on what is in the files.

    The next part steps thorugh the next part of the text one character at a time and stops when it sees the ending phrase. The "(?!<!\[endif\]-->)" part is a negative lookahead which in interpreted as "from here, look forward and see if the next characters match this sub-pattern. If they do not (hence the 'negative' part of the name) then carry on; otherwise stop processing this group and move on". This is embedded in the group "(    .)*" which tehchs the lookahead and then steps forward one character before repeating the whole match group test again. In this way it will match everything until it gets to the first occurrance of the eding character sequence where it will stop.

    The last part simply matches the end character sequence.

    In some patterns you will see the whole lookahead part replaced with ".*?" which is a non-greedy match of any character (but see the next comment) and relies on the part of the pattern that follows to limit the match. There have been wars fought over less, but I prefer the technique I've shown because it is extendable with arbitrarily complex subpatterns and is very explicit as to when to stop.

    I've use the '.' (dot) regex operator which matches all characters except the newline character and this is controlled by the (badly named) 'singleline' option. A better name might be 'dot matches newline'. Basically, if the opton is not set, then the dot does not match a new line character and so will stop at the end of each line of text. If the option is set, then the newline character is matches and so the match can span multiple lines if necessary.

    The pattern I've presented is the 'raw' pattern which is what is expected to be used by the regex engine. I don't know if this is acceptable to PowerGrep so you may need to add other things to it such as escape characters and delimiters. However, as PowerGrep is a stand-alone program, I would expect that it would accept the pattern just as I've shown it above.

    You have not shown us what else surrounds the target text and so there may be other things in the text that will trip this pattern up.

    Hope this helps

    Susan

  •  11-28-2008, 1:35 PM 48924 in reply to 48891

    Re: Find / Replace text within known tags

    Aussie Susan:

    The way this works is to first match the begining phrase (<!--\[if\x20gte\x20mso\x209\]>) where the spaces have been replaced with '\x20'. In some regexs you can use space characters but using '\x20' to represent a space makes the character and the number explicit (e.g. is "   " 1, 2, 3 or 4 spaces?????). You can also use '\w' to represent a whitespace character but this would also include tabs and newlne characters - in this case that may be OK but only you can tell baxed on what is in the files.

    Susan you have a typo there.  I think you meant to type \s in your explanation, not \w

     


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-28-2008, 1:48 PM 48925 in reply to 48924

    Re: Find / Replace text within known tags

    Since the only part of the offending text that is actually HTML is the comments that have the if you could just remove all everything between xml

    <(xml)[\s\S]+?</\1>

    Susan's suggestion wil do what was asked but I must admit I think there are some detail being left out of this question which why I'm ignoring the HTML comments which HTML-wise are the only thing that are valid.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
  •  11-30-2008, 5:30 PM 49001 in reply to 48924

    Re: Find / Replace text within known tags

    Michael,

    Thanks for the correction. It's funny how you can try to proof read your own work but I find I always read what I intended to write rather than what is actually there.

    Susan

View as RSS news feed in XML