Got more questions? Find advice on: ASP | SQL | XML | Windows
Welcome to RegexAdvice Sign in | Join | Help

Various ways to control single-line multi-line for PERL html-parsing?

  •  07-11-2007, 5:29 AM

    Various ways to control single-line multi-line for PERL html-parsing?

    I am doing parsing of html to pick up url's with url-titles from web-pages using PERL (actually I am using a Perl-script not made by me that takes url-adresses and regexps as input to produce  links from the wanted site). I often get problems related to controlling matching patterns in single lines vs multiple lines. Sometimes I always want the matching to stop at one line and sometimes I need the matching to cover more than one line. So my questions would be:

    1) How do I specify a matching to NOT continue to next line? Is there a general way to specify this or do I have to do it more detailed (see 2)?


    2) How can I control how far the matching goes when exceeding one line? Sometimes it appears that when I first enable a matching that goes over several lines it will go as far as possible.

    E.g. In the 4 examples below I want to avoid matching the first and second but at the same time match the third and fourth.

     I have managed to filter out second (with img tag) with this and matched third and fourth with this regexk, but it also matches the first (whic I don't want).

    <a href="(/iwantthis/folder/\d+,\d+,\d+_\d+,\d+.html)">([^<].*?)</a>

     In order to avoid matching the first I tried

    <a href="(/iwantthis/folder/\d+,\d+,\d+_\d+,\d+.html)">([^<].*?)</a><

    but it made a match at the first one with a wrong match on the urltitle. It ended up matching a lot of HTML-code from several consecuitive lines for the urltitle part.  

    --- EXAMPLE 1: 
    <div id="blabla"><div id="blabla2" style="margin-top: 0">
    <a href="/iwantthis/folder/0,16368,2483_2482095,00.html">Title of url and title I DON'T want</a>
    </div></div>
    --- EXAMPLE 2: 
    <div class="blabla4"><!--mainBlaBla-->	
    <a href="/iwantthis/folder/0,16368,1765_2489790,00.html"><img src="http://images.somesite.com/iDONTwantthis.jpg"></a>
    </div
    ---  EXAMPLE 3: 
    <h1><a href="/iwantthis/folder/0,16368,1765_2489790,00.html">I WANTthis url and title</a></h1>
    ---  EXAMPLE 4: 
    <h4 class="blabla3"><a href="/iwantthis/folder/0,16368,1765_2489630,00.html">I WANTthis url and title too</a></h4> 

    3) Is there a tool that can be used to se the html-code with special characters like newlines, returns, spaces etc better? Sometimes it seems difficult to know if  there are /n /r /s etc in the text. Any other tool that can be used that might not show all control characters but might make it easier to construct regexps for html-parsing with PERL?

     

    borgeh 

View Complete Thread