Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Match words NOT in html tags

Last post 11-24-2008, 9:30 AM by ddrudik. 9 replies.
Sort Posts: Previous Next
  •  11-21-2008, 11:23 AM 48631

    Match words NOT in html tags

    Hi everyone!

    I have a regex problem I could use your help with:

    I am using PHP to preg_replace words within a block of text with hyperlinks, but when the word is already in an hyperlink, it is still turned into a link. Basically the preg_replace regex I am using is "double linking" the word.

    In order to solve this I need some help in coming up with a regex that will match only words that are NOT in any html tags.

    The preg_replace I am using now is a simple word match: preg_replace("/\bword\b/","<a href=\"http://www.example.com\">Word</a>",$text);

    Thanks in advance for any help!

    Craig

  •  11-22-2008, 4:20 AM 48667 in reply to 48631

    Re: Match words NOT in html tags

    Please read the forum posting guide lines:

    http://regexadvice.com/forums/thread/47451.aspx

    Your question is, IMO, not clear.

  •  11-22-2008, 11:24 AM 48676 in reply to 48631

    Re: Match words NOT in html tags

    This is a common question, consider this example:

    http://regexadvice.com/forums/thread/45345.aspx


  •  11-24-2008, 5:18 AM 48752 in reply to 48676

    Re: Match words NOT in html tags

    Hi

    Thanks for the replies.

    I have come up with the following (only a slight change to your previously posted regex :)

    • (?s)\bFIND ME\b(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=((?!<body).)*[^?!(</body>|</a>)])

    This seems to be working fine at finding all words within an html page NOT contained within tags

    My next question is this - can you see anything wrong with the regex posted above? I have tested it using a few different scenarios and it seems to be fine, but I was just wondering if any of you regex guru's can see something I could be overlooking?

    Thanks in advance,

    Craig
     

     

  •  11-24-2008, 5:36 AM 48753 in reply to 48752

    Re: Match words NOT in html tags

    This part: [^?!(</body>|</a>)] does not do what you think it does.

    Everything between square brakets (called a character class) will match a single character and almost all of the "normal" meta characters have no special meaning inside character classes. So, [^?!(</body>|</a>)] will match any character except one of the following: ?, !, (, <, /, b, o, d, y, >, |, a, )

    See: http://www.regular-expressions.info/charclass.html

  •  11-24-2008, 5:51 AM 48754 in reply to 48752

    Re: Match words NOT in html tags

    Wistow:

    I have come up with the following (only a slight change to your previously posted regex :)

    • (?s)\bFIND ME\b(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=((?!<body).)*[^?!(</body>|</a>)])

    As noted, that is a problematic change.

    From the previous thread, this example might be of more help to your goal here:

    (?si)\bFIND ME\b(?![^<>]*</a>)(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=(?:(?!<body).)*</body>)

    See it in action here:

    http://www.myregextester.com/?r=02f3271f


  •  11-24-2008, 6:03 AM 48755 in reply to 48753

    Re: Match words NOT in html tags

    Thanks for the reply

    If I were to change:

    • (?s)\bFIND ME\b(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=((?!<body).)*[^?!(</body>|</a>)])

    to:

    • (?s)\bFIND ME\b(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=((?!<body).)*[^<])

    It would essentially be doing the same thing? And how would I then add to the regex to find a string of characters e.g. </a> rather than a single character in the character class?

    Thanks in advance,

    Craig
     

  •  11-24-2008, 6:18 AM 48759 in reply to 48754

    Re: Match words NOT in html tags

    ddrudik:
    (?si)\bFIND ME\b(?![^<>]*</a>)(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=(?:(?!<body).)*</body>)

    ddrudik, that seems to be exactly what I was looking for, and I now have more understanding of where I was going wrong with the regex I posted. (I posted my last reply unaware that you had replied, that's why it probably makes no sense :)

    Thanks for all the help!

  •  11-24-2008, 6:45 AM 48760 in reply to 48759

    Re: Match words NOT in html tags

    ddrudik:
    (?si)\bFIND ME\b(?![^<>]*</a>)(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=(?:(?!<body).)*</body>)

    With the above regex in mind, if I were to add |</span> after *</a>, so the full regex was:

    • (?si)\bFIND ME\b(?![^<>]*</a>|</span>)(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=(?:(?!<body).)*</body>)

    would this also mean the regex would exclude any text found in <span></span> tags? I believe it would, but i'm after an expert opinion :)

    Thanks in advance!

    Craig
     

  •  11-24-2008, 9:30 AM 48767 in reply to 48760

    Re: Match words NOT in html tags

    It's best to test to see how a pattern would work with any given source, see your new pattern in action with source that contains a span:

    http://www.myregextester.com/?r=c77b2c05

    Also acceptable for your new task:

    (?si)\bFIND ME\b(?![^<>]*</(?:a|span)>)(?=([^<>]*<[^>]+>[^<>]*)+$|[^<>]*$)(?=(?:(?!<body).)*</body>)


View as RSS news feed in XML