Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

find all URLs but not those within <code></code> tags

Last post 07-22-2008, 8:47 AM by ddrudik. 12 replies.
Sort Posts: Previous Next
  •  07-19-2008, 11:27 AM 44332

    find all URLs but not those within <code></code> tags

    hello guys,
    i am trying to make all URLs clickable .. thus surrounding them with an <a href> tag .. the problem is that i want exclude those which are enclosed in a code tag:

    i am using this regex to match the URLs

    ((((file|gopher|news|nntp|telnet|http|ftp|https|ftps|sftp)://)|(www\.))+(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(/[a-zA-Z0-9\&amp;%_\./-~-]*)?)

     which works very good ..

    now i need to extend it in a way that it does not match url within <code></code> blocks.

     should not match the url in:

    <code>asdkasdk lkjl www.url.com dajsldk lkj </code>

     

    thanks a lot!!

  •  07-20-2008, 8:18 PM 44388 in reply to 44332

    Re: find all URLs but not those within <code></code> tags

    Please read the positing guidelines in the sticky note at the top of this forum. Before we can give you a reasonable answer, we need to know the regex engine you are using so we understand its capabilities and limitations.

    However a quick answer that may be of use is to put

    <code((?!</code>).)*<code>

    as the first alternative in your pattern. This means that  the regex engine will (unless it is POSIX based) use the first alternative that matches and so effectively hide the rest of the pattern when the "code" tag is found.

    Susan 

  •  07-21-2008, 8:57 AM 44404 in reply to 44388

    Re: find all URLs but not those within <code></code> tags

    susan thanks for your answer. sorry i havent read the sticky topic .. however i tried to search for my problem before posting.

    i am using it within vbscript. hope that answers you regex engine question.

    i am not sure where to put the pattern you have mentioned? i mean how do i combine it with my url-matching-pattern?

    thx a lot..

  •  07-21-2008, 2:41 PM 44419 in reply to 44404

    Re: find all URLs but not those within <code></code> tags

    VBscript is more limited than other platforms for such operations, but here's how I would do that:

    <%
    Set regEx = New RegExp
    regEx.Global = True
    regEx.IgnoreCase = True
    teststring = "<code>asdkasdk lkjl www.url.com dajsldk lkj </code> www.url2.com test <code>asdkasdk2 lkjl www.url3.com dajsldk lkj </code> www.url4.com"
    regEx.Pattern = "<code>[\S\s]*?</code>"
    Set CodeMatches = regEx.Execute(teststring)
    regEx.Global = False
    For x = 0 to CodeMatches.Count-1
      teststring=regEx.Replace(teststring,Chr(1) & x & Chr(2))
    Next
    regEx.Pattern ="(?:(?:(?:(?:file|gopher|news|nntp|telnet|http|ftp|https|ftps|sftp)://)|(?:www\.))+(?:(?:[a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|(?:[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(?:/[a-zA-Z0-9\&amp;%_\./-~-]*)?)"
    regEx.Global = True
    Set URLMatches = regEx.Execute(teststring)
    teststring=regEx.Replace(teststring,"<a href=""redirect.asp?url=$&"">$&</a>")
    regEx.Global = False
    For x = 0 to CodeMatches.Count-1
      regEx.Pattern = Chr(1) & x & Chr(2)
      codeblock=CodeMatches(x)
      teststring=regEx.Replace(teststring,codeblock)
    Next
    response.write "<pre>" & teststring
    %>

    Note that redirect.asp is a script that doesn't exist, use whatever redirect script in the resulting HREFs.  Also note the use of [\&amp;] matches more than &amp; as well as other issues you seem to have with your URL matching pattern, the example above is just an ASP framework for how this might work and isn't necessarily a recommendation of a proper URL matching pattern.


  •  07-21-2008, 7:38 PM 44425 in reply to 44404

    Re: find all URLs but not those within <code></code> tags

    gabru:

    i am not sure where to put the pattern you have mentioned? i mean how do i combine it with my url-matching-pattern?

    Just tio answer this part of your question, I was suggesting that you put it in front of the appropriate part of your pattern.

    The way the (non-POSIX) regex engine processes alternatives is that it tries to match then in the same order they are specified and will 'match' the first one that succeeds (it may try later ones if it backtracks to the alternation again). Therefore a pattern such as:

    \w+?(a|b|c)

    will match alphanumeric characters to the first "a", "b" or "c" that it finds. If the next character it tests is an 'a' then it will immediately consider the alternation successfully matched and carry on without checking the 'b' or 'c' possibilities.

    In you case, you want to match something unless it is within "code" tags. Therefore you can use this alternation behaviour as follows (pseudo-code):

    (match code tagged text | match something else) 

    You have already written the "match something else" part, and I was suggesting that you add in the "match code tagged text" part with my pattern fragment.

    By the way, alternation is rather different to many of the other regex operators which generally apply to the single item immediately to their left. In other words, the '?' in the pattern 'asd2?' applies ONLY to the '2' and not the 'asd'. However, alternation applies to everything on either side of it to the end of the enclosing match group (remembering that the whole pattern is effectively enclosed in a matching group #0). Therefore the pattern 'abc|def" will match the text "abc" OR "def". That is why I put parentheses around the pseudo-code above - to make sure that the alternation is limited to the part I'm interested in.

    (BTW, I specified non-POSIX regex engines above because POSIX regex engines will always test all possible paths as they return the longest of all valid matches- non-POSIX engines normally return the first valid match they find).

    Make sense?

    Susan 

  •  07-21-2008, 10:26 PM 44428 in reply to 44425

    Re: find all URLs but not those within <code></code> tags

    susan thanks for your detailed description but i guess i am not an expert in regex .. therefore i dont 100% understand everything nor i know how my engine is called

    the only thing i know is that i am using vbscript and your version with alternation didnt work... am i doing something wrong or shouldnt it work in vbscript anyway?

     i think ddrudik version will work but before doing this i want to be sure there is no "simpler" solution ....

    thanks ddrudik for your time you took to write the code .. it helped me more to understand the problem  better. Is there really no other solution which can be solved with only one pattern?

  •  07-21-2008, 10:32 PM 44429 in reply to 44428

    Re: find all URLs but not those within <code></code> tags

    Can you:

    1) provide us with the snippet of your code
    2) tell us what "didn't work" ? Is there an error message or just unexpected results or what?
    3) a small sample of the real data you are working with

    My suggestion uses lookahead which is supported in the VBScript regex v5.5 - make sure you are using this and not an older version.

    Susan

     

  •  07-21-2008, 11:36 PM 44430 in reply to 44428

    Re: find all URLs but not those within <code></code> tags

    It's a much simpler solution if you switch to a platform that allows for performing functions on matches, such as PHP and preg_replace_callback.
  •  07-22-2008, 2:25 AM 44433 in reply to 44429

    Re: find all URLs but not those within <code></code> tags

    susan i am testing the regex here:

    www.grafix.at/ajaxed/console/  => Regex tab

    pattern:

     (<code((?!</code>).)*<code>|(((file|gopher|news|nntp|telnet|http|ftp|https|ftps|sftp)://)|(www\.))+(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(/[a-zA-Z0-9\&amp;%_\./-~-]*)?)

    searchstring:

    asdasd www.grafix.at <code>adasdad www.anurl.com </code>

    replacing with:

    X$&X

    results in matching all URLS .. also those within code block...

    thanks for help

     

  •  07-22-2008, 2:27 AM 44434 in reply to 44430

    Re: find all URLs but not those within <code></code> tags

    ddrudik:
    It's a much simpler solution if you switch to a platform that allows for performing functions on matches, such as PHP and preg_replace_callback.

    yeah i will switch platform only because of this ;)

  •  07-22-2008, 3:04 AM 44435 in reply to 44433

    Re: find all URLs but not those within <code></code> tags

    My mistake - my original suggestion should read:

    <code((?!</code>).)*</code>

    I forgot the '/' inside the ending tag!!!! Sorry. Without that, the alternation is never found and so the match will continue on.

    However, this is the first time I've realised that you are wanting this for a substitution, and so there will be other issues you will need to address. Basically, the "<code>.....</code>" text WILL form part of the overall match and this will present you with 2 problems. The first is that you are adding text around all of the matched text (by using the '$&' replacement text) so you will probably need to add in some additional parentheses and use that match group explicitly in your replacement. The other problem is that, because you are matching the "<code>...</code>" text, you will need to make sure that you replace this but without making any other changes.

    My suggestion is to change the pattern to:

    ((<code((?!</code>).)*</code>)|((((file|gopher|news|nntp|telnet|http|ftp|https|ftps|sftp)://)|(www\.))+(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(/[a-zA-Z0-9\&amp;%_\./-~-]*)?))

    and simply do a match. Check match group #4 and, if it is not null, perform the replacement yourself. (If you are getting multiple matches then you may want to start from the last match and work forward so that you are not changing the start/end offsets of each match as you make the substitutions)

     Susan

  •  07-22-2008, 3:17 AM 44436 in reply to 44435

    Re: find all URLs but not those within <code></code> tags

    thanks for the quick reply .. will try that! sounds good...
  •  07-22-2008, 8:47 AM 44448 in reply to 44436

    Re: find all URLs but not those within <code></code> tags

    gabru, when you have a working solution please post your vbscript code here as I am curious to see what worked.
View as RSS news feed in XML