Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

Parse IMG tags

Last post 06-20-2008, 4:53 AM by killahbeez. 17 replies.
Page 1 of 2 (18 items)   1 2 Next >
Sort Posts: Previous Next
  •  06-18-2008, 9:25 AM 43275

    Parse IMG tags

    Hi guys,

    I'm not too familiar with regular expressions and I've been having a hard time building an expression that will do what I want. Basically, what I need is a regular expression which will parse out all IMG tags found in a string. In addition, I need easy access (match groups) for the following attributes: src, height, width, alt.

    Here are some examples of img tags that this regular expression should properly parse:

    <img src="/images/image1.jpg" alt="Test Image" height="2" width="1" />
    <img src='/images/image1.jpg' alt='Test Image' height='2' width='1' />
    <img src='/images/image1.jpg' alt="Test Image" height='2' width="1" />

    <img height="2" width="1" src="/images/image1.jpg" alt="Test Image" />
    <img height='2' width='1' src="/images/image1.jpg" alt="Test Image" />
    <img height='2' width="1" src='/images/image1.jpg' alt="Test Image" />

    It's important that this expression parse the attributes no matter which order they are in, and if they are marked using ' or ". 

    Would anybody be so kind to help me put this expression together properly?

    Thank you so much for your time and efforts.

  •  06-18-2008, 10:07 AM 43277 in reply to 43275

    Re: Parse IMG tags

    Can you guarantee that all attributes will be contained in all img tags you plan to parse?  Also, can you guarantee that the img tags will not have other attributes and/or not use ' or " to surround values.
  •  06-18-2008, 10:32 AM 43278 in reply to 43277

    Re: Parse IMG tags

    Try this one

    <img[^>]+?((src=["'][^"']+["'])\s*|(alt=["'][^"']+["'])\s*|(height=["'][^"']+["'])\s*|(width=["'][^"']+["'])\s*)*[^>]+?>

  •  06-18-2008, 11:45 AM 43279 in reply to 43277

    Re: Parse IMG tags

    ddrudik:
    Can you guarantee that all attributes will be contained in all img tags you plan to parse?  Also, can you guarantee that the img tags will not have other attributes and/or not use ' or " to surround values.

    No I cannot. These regex will be used to parse images from many web pages.

  •  06-18-2008, 11:46 AM 43280 in reply to 43278

    Re: Parse IMG tags

    killahbeez:

    Try this one

    <img[^>]+?((src=["'][^"']+["'])\s*|(alt=["'][^"']+["'])\s*|(height=["'][^"']+["'])\s*|(width=["'][^"']+["'])\s*)*[^>]+?>

    This gets me all of the attributes I'm looking for. How do I then programatically reference each attribute? Now that I have these attributes which are not in any defined order how can I programatically access the value of the Src, Width, Height, and Alt attributes?

  •  06-18-2008, 12:06 PM 43284 in reply to 43280

    Re: Parse IMG tags

    That depends of what regexp engine do you use.

    In PCRE the named capture will be ?P<name>, so the regexp will be

    <img[^>]+?((?:src=["'](?P<src>[^"']+)["'])\s*|(?:alt=["'](?P<alt>[^"']+)["'])\s*|(?:height=["'](?P<height>[^"']+)["'])\s*|(?:width=["'](?P<width>[^"']+)["'])\s*)*[^>]+?>

    Look at what php preg_match_all function will return in here:

    http://www.myregextester.com/?r=205

     

  •  06-18-2008, 5:01 PM 43286 in reply to 43284

    Re: Parse IMG tags

    killahbeez:

    That depends of what regexp engine do you use.

    In PCRE the named capture will be ?P<name>, so the regexp will be

    <img[^>]+?((?:src=["'](?P<src>[^"']+)["'])\s*|(?:alt=["'](?P<alt>[^"']+)["'])\s*|(?:height=["'](?P<height>[^"']+)["'])\s*|(?:width=["'](?P<width>[^"']+)["'])\s*)*[^>]+?>

    Look at what php preg_match_all function will return in here:

    http://www.myregextester.com/?r=205

     

     

    I am using .NET. Would you happen to know how I can achieve this with .NET?

  •  06-18-2008, 5:06 PM 43287 in reply to 43286

    Re: Parse IMG tags

    I think I figured this out myself:

    <img[^>]+?((src=["'](?<src>[^"']+)["'])\s*|(alt=["'](?<alt>[^"']+)["'])\s*|(height=["'](?<height>[^"']+)["'])\s*|(width=["'](?<width>[^"']+)["'])\s*)*[^>]+?>

  •  06-18-2008, 5:40 PM 43288 in reply to 43278

    Re: Parse IMG tags

    killahbeez:

    Try this one

    <img[^>]+?((src=["'][^"']+["'])\s*|(alt=["'][^"']+["'])\s*|(height=["'][^"']+["'])\s*|(width=["'][^"']+["'])\s*)*[^>]+?>

     

    I've found some img tags that this expression catches but doesn't properly group. Can you help me figure out why please? Here's one:

    <img id="sw3-0" class="swatchQ" src="/media/images/products/msw/16840BB4154_BLK_msw_v1_m56577569831612853.jpg" width="14" height="14" alt="BLK" title="BLK" border="0">

  •  06-18-2008, 7:54 PM 43294 in reply to 43288

    Re: Parse IMG tags

    My effort would be along different lines:

    <img(\s+((\w+) (=("[^"]"|[^\s>]+)|)))*/?>

    with the 'ignore case' option set as a safety measure in case someone used "<IMG" at the start.

    This pattern will accept any number of attributes within the tag and an option "/>"  or ">" tag ending. Each attribute optionally can be followed by an '=' and a value which can be a quoted string (without escaped or doubled double quotes) or any string of non-whitespace characters.

    Each capture of  match group #2 (you said you were using .NET so captures are available to you) will contain each attribute and its value as a whole. The match group #3 captures will contain just the attribute names in the order they appear and the captures of match group #4 will contain the values (or NULL if the attribute has no value associated with it).

    You will need to scan the captures of match group #3 for the attributes you are interested in and then look at the corresponding capture in match group #4 for any possible value.

    Susan 

  •  06-19-2008, 2:00 AM 43303 in reply to 43294

    Re: Parse IMG tags

    This regexp will cover all

    <img\s+(?:(?:src=["'](?P<src>[^"']+)["'])\s*|(?:alt=["'](?P<alt>[^"']+)["'])\s*|(?:height=["'](?P<height>[^"']+)["'])\s*|(?:width=["'](?P<width>[^"']+)["'])\s*|\w+=["'][^"']+["']\s*)*/?> 

  •  06-19-2008, 10:24 AM 43322 in reply to 43303

    Re: Parse IMG tags

    It's easy to make incorrect assumptions about how the HTML would be coded that you will encounter, for example, this is perfectly valid HTML that the given patterns don't allow for:

    <img id="sw3-0" class="swatchQ" src="/media/images/products/msw/16840BB4154_BLK_msw_v1_m56577569831612853.jpg" width=14 height=14 alt=BLK title=BLK border=0>


  •  06-19-2008, 1:45 PM 43330 in reply to 43322

    Re: Parse IMG tags

    the validity of that html depends on the DTD of that html. On a strict doctype that is not a valid one. The value of the attributes MUST be enclosed in quotes.
  •  06-19-2008, 3:05 PM 43336 in reply to 43279

    Re: Parse IMG tags

    ITistic:

    ddrudik:
    Can you guarantee that all attributes will be contained in all img tags you plan to parse?  Also, can you guarantee that the img tags will not have other attributes and/or not use ' or " to surround values.

    No I cannot. These regex will be used to parse images from many web pages.

    Not sure that DTD STRICT can be assumed for all target pages in this case, but the asker can confirm.


  •  06-19-2008, 4:05 PM 43340 in reply to 43330

    Re: Parse IMG tags

    killahbeez:
    the validity of that html depends on the DTD of that html. On a strict doctype that is not a valid one. The value of the attributes MUST be enclosed in quotes.

    Actually that's not true. HTML attribute don't have to be quoted unless the attribute value contains a space. ddrudik's example would only be invalid  is a strict DTD because it contains a deprecated attribute.(border)

    It is XHTML attributes must be quoted.

    Either way his point is valid HTML can be written in a way not expected but still be perfectly valid Mark-up. Which is why I often recommend using the HTML DOM to parse HTML instead of a regex. Especially if you are not responsible for creating the mark-up where you could have established a consistent standard and style. Chances are the HTML you'll get won't validate anyway.


    Michael

    "In theory, theory and practice are the same. In practice, they are not."
    Albert Einstein
Page 1 of 2 (18 items)   1 2 Next >
View as RSS news feed in XML