Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Parsing HTML: matches not inside a link <a .....></a>

Last post 10-07-2009, 4:18 PM by DublinFrench. 15 replies.
Page 1 of 2 (16 items)   1 2 Next >
Sort Posts: Previous Next
  •  10-04-2009, 5:20 AM 56563

    Parsing HTML: matches not inside a link <a .....></a>

    I need a regex that matches a string containing anything, but not inside a html tag link

    I'm trying to insert link in a HTML text so I need to avoid matches already inside a tag like <a ......>  </a> . 

    I'm working with PHP. 

    For example:

        -i'm searching

                     the

        - my char is:

                     this the text i want and <a href="test.com" onclick='.......'>this the link</a> and here another example with the

       - I need to match the 2 the in bold and not the the

    I hope it is clear.  I use the Dom but sometimes a node contains a link inside and my str_replace gives me problems. I start to have bad times there...

     It is to match some words to send users to a dictionnary  / glossary.

     Any help appreciated, thanks :)

     

     


  •  10-04-2009, 7:39 AM 56565 in reply to 56563

    Re: Parsing HTML: matches not inside a link <a .....></a>

    (?is)\bthe\b(?!(?:(?!</?a\b).)*</a>)

    but the correct way to do it is to use DOM for this


    http://portal-vreme.ro
  •  10-04-2009, 8:12 AM 56567 in reply to 56565

    Re: Parsing HTML: matches not inside a link <a .....></a>

    Hi killahbeez

    Thanks for that. I'm going to test it right now.

    You right for the DOM. Except I have problems when some links are inside the node content.

    In short:

    public static $listforbiddenTags = array("a", "img", "javascript", "css", "script");

    $doc->loadHTML($content);
    $root = $doc->firstChild;
    $doc->normalizeDocument();
    self::$dom=$doc;
    parseNodes($doc, $keyWord, $replacement);


    function parseNodes($node, $keyWord, $replacement)
    {
        $node->normalize();
        if ($node->hasChildNodes())
        {
            $subNodes = $node->childNodes;
            foreach ($subNodes as $subNode)
            {
                parseNodes($subNode, $keyWord, $replacement);                
            }
        }
        else
        {
            if (!in_array($node->parentNode->nodeName, self::$listforbiddenTags) && $node->nodeType == XML_TEXT_NODE
                && strlen(trim($node->wholeText))>=1 && used($node->nodeValue))
            {
                $newelement = self::$dom->createTextNode(str_replace($keyWord, $replacement, $node->nodeValue));
                $node->parentNode->replaceChild($newelement, $node);
                $node->normalize();
            }
        }
    }

    And problems: I can still have some HML tags inside my $node->nodeValue even if $node->hasChildNodes() return false. I even have some cases where the $node->nodeValue starts with a <span  ....

     That's why I was considering regexpr as the DOM lib gives me some bad times there.

  •  10-04-2009, 9:22 AM 56569 in reply to 56567

    Re: Parsing HTML: matches not inside a link <a .....></a>

    Did you tried to iterate throw all the childNodes and get the nodeType of them in the cases where hasChildNodes is not returning correctly? (what $node->childNodes->length return for that?)


    http://portal-vreme.ro
  •  10-04-2009, 11:30 AM 56570 in reply to 56569

    Re: Parsing HTML: matches not inside a link <a .....></a>

    ==> problems occurs there

    print("\n<br />node->childNodes->length: ".$node->childNodes->length);
    print("\n<br />node->parentNode->length: ".$node->parentNode->length);
    print("\n<br />hasChildNodes: ".$node->hasChildNodes());
    print("\n<br />length: ".$node->childNodes->length);
    print("\n<br />node->value: ".$node->nodeValue);

    => displays:

     <br />node->childNodes->length:

    <br />node->parentNode->length:

    <br />hasChildNodes: 
    <br />length:
    <br />node->value:

    <span onmouseout="......." ><a href="http://test.com" target="_blank" title="test" class="class2" >it's a lin!!!!</a></span>
    Obviously there s something wrong.
    Cheers
     
     

     

  •  10-04-2009, 12:05 PM 56571 in reply to 56570

    Re: Parsing HTML: matches not inside a link <a .....></a>

    I don't know if this is the case but I have this, and it's working like I expect.

    <?
    Header("Content-type:text/plain");


    $content=<<<didi
        <div><span onmouseout="......." ><a  href="http://test.com" target="_blank" title="test"  class="class2" >it's a lin!!!!</a></span></div>
    didi;


    $doc = new DOMDocument();
    $doc->loadHTML($content);

    $divs=$doc->getElementsByTagName("div");

    foreach($divs as $item){
        echo "hasChildNodes: ".$item->hasChildNodes()."\n";
        echo "nodeValue: ".$item->nodeValue."\n";
        echo "nodeName: ".$item->nodeName."\n";
        echo "Number of childNodes: ".$item->childNodes->length."\n";
    }

    /*OUTPUT

    hasChildNodes: 1
    nodeValue: it's a lin!!!!
    nodeName: div
    Number of childNodes: 1

    */

    ?>


    http://portal-vreme.ro
  •  10-04-2009, 12:57 PM 56572 in reply to 56571

    Re: Parsing HTML: matches not inside a link <a .....></a>

    one of my problems:

    $content = "
    <ul>
        <li>title: possibility to <a href=\"www.test.com\">define the title</a> from the link regards where this title is located</li>
    </ul>
    ";
    Class Parsing
    {
        public static $listforbiddenTags = array("a", "img", "javascript", "css", "script");
        public static $dom = null;
       
        function parseNodes($node, $keyWord, $replacement)
        {
            if ($node->hasChildNodes())
            {
                $subNodes = $node->childNodes;
                foreach ($subNodes as $subNode)
                {
                    self::parseNodes($subNode, $keyWord, $replacement);                
                }
            }
            else
            {
                if (isset($node->nodeValue))
                {
                   
                    $newelement = self::$dom->createTextNode(str_replace($keyWord, $replacement, $node->nodeValue));
                    $node->parentNode->replaceChild($newelement, $node);
                }
            }
        }

    }

    $keyWord="title";
    $replacement="GOT YOU";

    $doc = new DomDocument('1.0', 'UTF-8');

    $doc->loadHTML($content);
    $root = $doc->firstChild;
    //$doc->normalizeDocument();

    Parsing::$dom=$doc;

    Parsing::parseNodes($doc, $keyWord, $replacement);

    $content=$doc->saveHTML();

    print(($content));

     

    /* OUTPUT

    <br />node->nodeType => 3 <-> 
    <br />node->value: title: possibility to
    <br />node->nodeType => 3 <->
    <br />node->value:
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><ul>
    <li>GOT YOU: possibility to <a href="www.test.com">define the title</a> from the link regards where this title is located</li>
    </ul></body></html>
    /*What s wrong: 
       1 - the last title was not replaced. 
       2 - And I have new HTML tags around now, but it's not very important.
    Anyway i'm not sure anymore I can use the DOM to do it. 
    What is your opinion about this?
     Cheers 
     
  •  10-05-2009, 12:11 AM 56576 in reply to 56572

    Re: Parsing HTML: matches not inside a link <a .....></a>

    I see now the problem you have.

    The solution will be to check when you don't have childNodes (and have nodeType 3 - #text) to see if the parentNode is not A and then to make the replace
    (or you can use xpath to check if any of the ancestor is A, in the case you have something like <a href="http://regexadvice.com/forums/AddPost.aspx?PostID=56572#">test<b>bold text</b></a>, for "bold text" the parentNode is not A but it has an ancestor that is A).

    The DOM is more safe, but if the regexp I gave works for you, you can use it.


    http://portal-vreme.ro
  •  10-05-2009, 4:07 AM 56577 in reply to 56576

    Re: Parsing HTML: matches not inside a link <a .....></a>

    Hi., thanks for your answer and for your time :)

    I had several problems with the DOM, and this one is only one of them...

     I'm trying now with regular expressions , and I use a trick: I had some dirty chars nex to the occurences I don't want to rerplace (1-already a link 2-inside a tag definition), and after I simply replace all the others, then I delete the dirty chars I added before. The algo can work. Actually, it does, except I still have 3 problems:

     

     => this one put ##### before the keyword if this one is a link

    $content = preg_replace( '|(<a[^>]+>)(.*)('.$keyword.')(.*)(</a[^>]*>)|Ui', '$1$2#####$3$4$5', $content);

        1 - sometimes the keyword can appear more than once inside the link and this regexp replaces it only once. There should be an instruction to ask to target all the occurences but I'm still learning and I don't find it. Furthermore, the U flag makes the regexp to match only the first occurence, and without the U flag, I match only the last one. Weird.

        2 - sometimes I have 2 links inside $content , and the regexp matches blablsablabla </a><a href='....'> dsrg dsfg gdf    => in that case I try to add the condition "not contain </a> in $2 and $4 but I still block on the syntax.

        3 - last problem: sometimes my keyword can contains some * or . , etc....   I'm still looking rof a way to escape special chars in my strings in a generic way.

     This is  where I am today, and at this stage I have better results than with the DOM...

     Cheers :)

  •  10-05-2009, 6:11 AM 56579 in reply to 56577

    Re: Parsing HTML: matches not inside a link <a .....></a>

    Why don't you use my regexp? It's not working properly? If that is the case give me an input content and an expected output

    http://portal-vreme.ro
  •  10-05-2009, 7:46 AM 56582 in reply to 56579

    Re: Parsing HTML: matches not inside a link <a .....></a>

    Because I have problems to adapt it.  I understand the it should resolve the point of links inside but I think I miss something. I'm still learning regexp and I don;t always can do what I want.

    $content = preg_replace('|(?is)\b'.$keyword.'\b(?!(?:(?!</?a\b).)*</a>)|i', '$1$2&&&$3$4$5', $content);

  •  10-05-2009, 12:55 PM 56589 in reply to 56582

    Re: Parsing HTML: matches not inside a link <a .....></a>

    $content=<<<didi
        <div>
            <span onmouseout="......." >
                .*keyword.*
                <a  href="http://test.com" target="_blank" title="test"  class="class2" >it's a .*keyword.*!!!!</a>
                .*keyword.* blabla
            </span>
            .*keyword.*
        </div>

    didi;

    $keyword=".*keyword.*";
    $replaceWith="DIDI";


        $content=preg_replace('{(?isx) #setting inline modifier: case insensitive, dotall, pcre_extended (whitespaces are ignored and can have comments
                                '.preg_quote($keyword).' #preg_quote is used for escaping the characters which are special meanings in regex in this case * and .
                                (?!   
                                    (?:(?!</?a\b).)*</a> #if you are in between anchor tag
                                ) #negative lookahead meaning dont match if you are in anchor tag
                            }',$replaceWith,$content);

    echo $content;

    /*
    INPUT
        <div>
            <span onmouseout="......." >
                .*keyword.*
                <a  href="http://test.com" target="_blank" title="test"  class="class2" >it's a .*keyword.*!!!!</a>
                .*keyword.* blabla
            </span>
            .*keyword.*
        </div>

    OUTPUT
        <div>
            <span onmouseout="......." >
                DIDI
                <a  href="http://test.com" target="_blank" title="test"  class="class2" >it's a .*keyword.*!!!!</a>
                DIDI blabla
            </span>
            DIDI
        </div>
    */

    http://portal-vreme.ro
  •  10-06-2009, 4:03 AM 56599 in reply to 56589

    Re: Parsing HTML: matches not inside a link <a .....></a>

    Hi

    I tried to adapt it and to catch the $2, but I have a blank page. It's weird because I understand your regexp, it's clearly much better than mine and it should work.

    $content=preg_replace('{(?isx)('.preg_quote($keyword).')(?!(?:(?!</?a\b).)*</a>)}',$replacementStart.'$2'.$replacementEnd, $content, $nbRep, $count);

     Anyway thanks for your time, even if I don;t make it work, your posts were really insrtructive for me.

     Cheers

     

    DF

  •  10-06-2009, 1:52 PM 56606 in reply to 56599

    Re: Parsing HTML: matches not inside a link <a .....></a>

    I definitively have problems to adapt this one. i try to match the keyword so I can surround it instead of replacing it. returns me nothing.
  •  10-06-2009, 2:04 PM 56608 in reply to 56606

    Re: Parsing HTML: matches not inside a link <a .....></a>

    check private message

    http://portal-vreme.ro
Page 1 of 2 (16 items)   1 2 Next >
View as RSS news feed in XML