Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Problem filtering HTML with a negative lookahead construct

Last post 09-28-2009, 12:07 PM by killahbeez. 3 replies.
Sort Posts: Previous Next
  •  09-28-2009, 3:40 AM 56464

    Problem filtering HTML with a negative lookahead construct

    Hi,

    I have hundreds of links on a website that were manually programmed in HTML as hyperlinks. Our designer informed me that they shouldn't be links, so rather than line by line remove the <A> element from my code I'm looking to use the Regular Expression search and replace function within my development IDE to resolve this issue for me.

    Here is some example HTML code I have...

    <dl>
            <dt>Residential Mortgages</dt>
            <dd><a href="/Residential-Mortgages/Purchases.php">Purchases</a></dd>
            <dd><a href="/Residential-Mortgages/Remortgages.php">Remortgages</a></dd>
            <dd><a href="/Residential-Mortgages/First-Time-Buyers.php">First Time Buyers</a></dd>
            <dd><a href="/Residential-Mortgages/Holiday-Homes.php">Holiday Homes</a></dd>
            <dd><a href="/Residential-Mortgages/Second-Homes.php">Second Homes</a></dd>
            <dd><a href="/Residential-Mortgages/Pied-a-Terre.php">Pied-à-Terre</a></dd>
            <dd><a href="/Residential-Mortgages/Shared-Equity.php">Shared Equity</a></dd>
            <dd><a href="/Residential-Mortgages/Government-Schemes.php">Government Schemes</a></dd>
            <dd><a href="/Residential-Mortgages/Secured-Loans.php">Secured Loans</a></dd>
            <dd><a href="/Residential-Mortgages/Large-Loans.php">Large Loans</a></dd>
        </dl>
        </dl>
    <p><a href="http://www.google.com/" rel="external">my link</a></p>

    The problem I'm having is that I want to remove all the <A> elements from the page except for any that have the "rel" attribute set.

     

    So for example match all the <A> elements inside the <DL> element but not the <A> element within the <P> element at the bottom.

    I've been using http://gskinner.com/RegExr/ to help me test my RegEx and so far I have the following...

    <a href=".+"(?! rel="external")>.+</a> 

    ...but it doesn't work as I expected (i.e. it still matches every <A> element on the page), so I need a little guidance please.

    Many thanks!

    M.

  •  09-28-2009, 10:15 AM 56472 in reply to 56464

    Re: Problem filtering HTML with a negative lookahead construct

    (?is)<a href=(?:(?!rel="external"|>).)+>.+?</a>

    http://portal-vreme.ro
  •  09-28-2009, 10:57 AM 56474 in reply to 56472

    Re: Problem filtering HTML with a negative lookahead construct

    Hi Killahbeez,

    Thanks, that helped great!

    But can you help explain this a little bit please as I'm fairly new to RegExp and it might help others who are in a similar boat to myself.

    First of all I'm not sure what the (?is) is for?

    I can see that for the href attribute you are looking to match any character (using the period .) one or more times (using the +) and then wrapping the whole thing in a non-capturing group.

    But there are two confusing elements which are:

    1.) why does removing the non-capturing group break the RegExp? Do lookaround constructs always need to be contained inside a non-capturing group?

    and

    2.) why does removing the OR statement | from inside the negative lookahead break the RegExp? as you are already including a closing > character for the <A> element after the non-capturing group.

    If you could help me understand why this RegExp works I would be grateful.

    Many thanks.

    M. 

  •  09-28-2009, 12:07 PM 56476 in reply to 56474

    Re: Problem filtering HTML with a negative lookahead construct

    (?is) means that I set case insensitive and dotall modifiers (meaning that . can be newline also)

    1 - non-capturing means that if you don't have to use backreferences you don't have to capture it

    2 - if you don't use the alternate, the greedy "." will match more than a tag, till the last rel="external". You can see how it works if you delete the "|>" part


    http://portal-vreme.ro
View as RSS news feed in XML