Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

ignore anything between <!-- -->

Last post 05-04-2006, 9:56 AM by Sergei Z. 3 replies.
Sort Posts: Previous Next
  •  05-03-2006, 10:19 PM 17319

    ignore anything between <!-- -->

    hi everyone, thanks for being here for regex help. i am running a regex through a big XML database (a wikipedia dump), and i want to make it so when scanning through each article, it will ignore anything placed between <!-- and -->, which is wikipedia's commenting area. my regex finds misspelled words, so if there are misspelled words in the commenting area, i want it to ignore it.

    this for instance this is a result i get back when running my regex:
      <!--NOTE: Copy everything below here. Remember to remove unneccessary sections-->
    notice how unnecessary has two "c"s, which makes it spelled wrong. can a regex be told not to look within commenting like that every time (thus not returning 'false positives' like this)?

    here is a URL with the regex:
    http://en.wikipedia.org/wiki/User:JoeSmack/regex

    thanks much
    Joe
  •  05-04-2006, 9:49 AM 17335 in reply to 17319

    Re: ignore anything between <!-- -->

    pattern

    (?<!<!--[^!>]*)\b unneccessary(?![^>]*-->)

    will only match on 2nd occurrence of *unneccessary* (bolded) in the input:

     <!--NOTE: Copy everything below here. Remember to remove unneccessary sections-->
    NOTE: Copy everything below here. Remember to remove unneccessary sections

  •  05-04-2006, 9:50 AM 17336 in reply to 17335

    Re: ignore anything between <!-- -->

    correction: make it

    (?<!<!--[^!<>]*)\bunneccessary(?![^<>]*-->)

  •  05-04-2006, 9:56 AM 17337 in reply to 17336

    Re: ignore anything between <!-- -->

    or even

    (?<!<!--[^<>]*)\bunneccessary(?![^<>]*-->)

    b/c it reduces the chances for exclamation sign to be picked up by the negative look-behind. U should be aware of the limitations: it will not work if *<* or *>* are present INSIDE the comments; it might give you unexpected output.

View as RSS news feed in XML