I am doing parsing of html to pick up url's with url-titles from web-pages using PERL (actually I am using a Perl-script not made by me that takes url-adresses and regexps as input to produce links from the wanted site). I often get problems related to controlling matching patterns in single lines vs multiple lines. Sometimes I always want the matching to stop at one line and sometimes I need the matching to cover more than one line. So my questions would be:
1) How do I specify a matching to NOT continue to next line? Is there a general way to specify this or do I have to do it more detailed (see 2)?
2) How can I control how far the matching goes when exceeding one line? Sometimes it appears that when I first enable a matching that goes over several lines it will go as far as possible.
E.g. In the 4 examples below I want to avoid matching the first and second but at the same time match the third and fourth.
I have managed to filter out second (with img tag) with this and matched third and fourth with this regexk, but it also matches the first (whic I don't want).
<a href="(/iwantthis/folder/\d+,\d+,\d+_\d+,\d+.html)">([^<].*?)</a>
In order to avoid matching the first I tried
<a href="(/iwantthis/folder/\d+,\d+,\d+_\d+,\d+.html)">([^<].*?)</a><
but it made a match at the first one with a wrong match on the urltitle. It ended up matching a lot of HTML-code from several consecuitive lines for the urltitle part.
--- EXAMPLE 1:
<div id="blabla"><div id="blabla2" style="margin-top: 0">
<a href="/iwantthis/folder/0,16368,2483_2482095,00.html">Title of url and title I DON'T want</a>
</div></div>
--- EXAMPLE 2:
<div class="blabla4">
<a href="/iwantthis/folder/0,16368,1765_2489790,00.html"><img src="http://images.somesite.com/iDONTwantthis.jpg"></a>
</div>
--- EXAMPLE 3:
<h1><a href="/iwantthis/folder/0,16368,1765_2489790,00.html">I WANTthis url and title</a></h1>
--- EXAMPLE 4:
<h4 class="blabla3"><a href="/iwantthis/folder/0,16368,1765_2489630,00.html">I WANTthis url and title too</a></h4>
3) Is there a tool that can be used to se the html-code with special characters like newlines, returns, spaces etc better? Sometimes it seems difficult to know if there are /n /r /s etc in the text. Any other tool that can be used that might not show all control characters but might make it easier to construct regexps for html-parsing with PERL?
borgeh