Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Trying to get url and anchor text from html content, output {anchor}: {url}

Last post 12-01-2008, 6:06 PM by Aussie Susan. 2 replies.
Sort Posts: Previous Next
  •  12-01-2008, 5:19 PM 49036

    Trying to get url and anchor text from html content, output {anchor}: {url}

    Heading says it all. i've got files which I want to convert to text, but without losing the urls. So I am trying to construct a regular expression to convert these from the normal

    <a href="http://www.someurl.com/path/to/file.php" target="whatever" title="maybe">Here's the anchor text</a>

     to

    Here's the anchor text: http://www.someurl.com/path/to/file.php

     I've tried lots of ways. This is the latest, and I'm stuck. Several hours trying to do this!

    <?php
    $text = 'yabba<a href="http://www.gardenzone.info/" target="_blank">The Gardenzone</a>dabba doo';
    $text = eregi_replace('<a href="(.+)"(.+)>(.+)</a>',"$3: $1",$text);
    echo "$text<br />$text2";
    ?>
     


    Some kind of gothy geek
  •  12-01-2008, 5:43 PM 49038 in reply to 49036

    Re: Trying to get url and anchor text from html content, output {anchor}: {url}

    i don't do pht but based on what you've done thus far, try:

    $text = eregi_replace('<a href="(.+?)".*?>(.+?)</a>',"$2: $1",$text);

  •  12-01-2008, 6:06 PM 49039 in reply to 49036

    Re: Trying to get url and anchor text from html content, output {anchor}: {url}

    You don't actually tell us what is going wrong, so I'm guessing that the "href" attribute value is coming back as something like

     http://www.someurl.com/path/to/file.php" target=

    The basic problem is that the "+" quantifier is greedy and tries to match as many characters as possible, and when applied to the '.' operator this means that it will try to grab all characters to the end of the string. It will then start to backtrack (because the rest of the pattern will repeatedly fail) until it gets the the "=" after "target".

    Now there are several solutions but none (that I know of) that can be used with the 'ereg' family of funcitons because they use the POSIX regex syntax and POSIX does not have the 'extra features' needed to solve this problem.

    If you can switch to using the 'preg' family of functions (which are supposedly faster but certainly do provide for more "modern" regex BLOCKED EXPRESSION then making the first match group quantifier non-greedy as in:

    <a href="(.+?)"(.+)>(.+)</a>

    will provide the matches you indicate.

    Note that the preg functions require the pattern to be delimited (I've used the '#' character) and the 'i' modifier specified to get back the 'case insensitive' matching. So the function call would something like (can't test this but the idea is OK):

    preg_replace('#<a href="(.+?)"(.+)>(.+)</a>#i',  '$3: $1", $text);

    Susan

     

View as RSS news feed in XML