Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Stripping <IMG> tag containing an invalid URL

Last post 07-29-2008, 2:47 AM by jimbo1. 3 replies.
Sort Posts: Previous Next
  •  07-07-2008, 1:42 AM 43805

    Stripping <IMG> tag containing an invalid URL

    Hello again,

    Thanks for all the great help so far. I've got another slightly more involved requirement now, once again for the following HTML <IMG> tag:

    <img src="http://www.client_url.com.au/new/images/client_logo.gif" height="250" width="570" alt=""Client Logo/>

    If the URL pointed to by the SRC attribute is invalid, I need to strip out the entire IMG tag from the text string I'm validating.

    So far, I've built the following RegExp to identify an IMG tag containing a valid SRC URL:

    <img\b[^>]*(?:src=(['"])(http|https):\/\/(\w+:{0,1}\w*@)?(\S+)(:[0-9]+)?(\/|\/([\w#!:.?+=&%@!\-\/]))?)[^>]*\/?>

    NOTE: For the strings I'm dealing with, it's safe to assume all HTML Tag Attributes will be enclosed by double quote characters.

    I'm having a nightmare trying to re-write this RegExp to recognise a tag containing an invalid URL. I've been trying variations of (?!<expn_to_identify_a_valid_url>), but after thirty minutes appear to be making no progress.

    Can anybody offer any assistance?

    Thanks in advance for any further help.

    James

     

     

     

     

  •  07-07-2008, 8:02 PM 43850 in reply to 43805

    Re: Stripping <IMG> tag containing an invalid URL

    James,

    When you strip out all of the duplications and optional parts that won't be used, your pattern reduces to:

    <img.*?(?:src=(['"])https?://(\S+)\1)[^>]*/?>

    (I've taken the liberty of actually checking for the matching single- or double-quote which you seem to be setting up but not using.)

    The reason is that, once  you use the '(\S+)' sub-pattern, then the regex engine will have matched every non-space character it can find and it will only backtrack to locate the '/?>' part at the end. The other optional sub-patterns that follow the '(\S+)' part will all be ignored as they are only looking for non-space characters that will have already been matched by the '(\S+)'.

    The problem of determining if a URL is valid or not is non-trivial and, as the gurus of this forum (ddrudik, mash et al) have often said, it cannot really be done by a regex. What you consider 'valid' may not be what I do, and something that both of us agree is 'valid' may not represent something that actually exists in the (cyber)world.

    Also, are you wanting to match an invalid URL so that you can replace it with a null string, match it so that the 'then' clause of an 'if... then' statement will be used to remove the whole tag, or are you wanting to fail a match if the URL is invalid (presumably because you want to sue the 'else' clause of an 'if...then...else'  statement).

    There are two methods that I tend to use to exclude something from a match:

    1) if the unwanted string is known to exist at a certain point in the string, then use a negative lookahead at that point:

    src=(?!")\S+

    will not match the literal string 'src=' that is followed by a double-quote but is followed by any other non-space character.

    2) if the character string is somewhere ahead of you but you don't know where, then something like:

    src="(((?!abc).)+)"

    will match whatever follows the 'src="' until the next double quote UNLESS the string "abc" is found. This technique moves forward 1 character at a time to perform the search and match process.

    If you can decide exactly what you mean by a 'valid URL' then technique #1 is probably the one you want. IF you decide that an 'invalid URL' is one that contains some pattern somewhere inside it, then technique #2 is the one you might consider.

    Susan 

  •  07-29-2008, 12:47 AM 44677 in reply to 43850

    Re: Stripping <IMG> tag containing an invalid URL

    Hi Susan,

    Thanks for your detailed response of a few weeks ago. I actually got pulled off this problem, and put back onto what I specialise in; Oracle database development; primarily PL/SQL.

    The RegExp work has now returned to bite me on the bum. In terms of this issue, I can see where you're coming from, i.e. the problems associated with identifying an invalid URL.

    For the purpose of what I'm dealing with, I'm trying to identify any <IMG> tags containing a SRC url that is not preceeded by the string "http://". For example, the following would be an Image tag containing what our Client has deemed to be an 'invalid' URL:

    <img src="www.client_url.com.au/new/images/client_logo.gif" height="250" width="570" alt=""Client Logo/>

    I guess I'll need to use a negative lookahead to identify this?

    I'll see what I'm able to put together this arvo.

    Cheers.

    James
  •  07-29-2008, 2:47 AM 44680 in reply to 44677

    Re: Stripping <IMG> tag containing an invalid URL

    I've put together the following RegExp to identify an invalid SRC url in an IMG tag.

    The criteria for determining whether the URL is invalid are:

    1. The URL does not begin with the string http://www or https://www.
    2. The Image File referenced by the URL does not have one of the following extensions: jpeg|jpg|gif|png

    It is safe to assume that the URL will be enclosed between either double or single quotes.

    My 'valid' test <IMG> tag is:

     <img src="http://www.client_url.com.au/new/images/client_logo.jpg" height="250" width="570" alt="Client Logo"/>

    Here are examples of 'invalid' URLs (according to our Client's requirements):

    1. <img src="www.client_url.com.au/new/images/client_logo.jpg" height="250" width="570" alt="Client Logo"/>
    2. <img src="http://www.client_url.com.au/new/images/client_logo.bmp" height="250" width="570" alt="Client Logo"/>

    The RegEx I have come up with (with the help of Sue's suggestion above) to identify those 'invalid' URLs, is:

    <img.*?(?:src=(['"])(?:(?!https?://www\.(\S+))|(?!https?://(\S+)\.(gif|jpg|jpeg|png))))[^>]*/?>

    I'm sure it is probably possible to simplify this RegEx, so I guess that is my next question. Can anybody come up with something more elegant? Wink

View as RSS news feed in XML