Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

identifying img tags in HTML

Last post 07-14-2009, 9:58 AM by komandin.alexander. 3 replies.
Sort Posts: Previous Next
  •  07-09-2007, 4:54 PM 32770

    identifying img tags in HTML

    Hello,

     I have to match image tags in HTML. Sample image tags that will be in my HTML are:

    <img onresizestart="return false;" title="2 items addressed in Industrial Occupations" src="/std_images/1015/1015_6642.gif" border="0"  >

     Which one is more efficient?

    A) <img[\s]*(.*?)[\s]*>

    OR

    B) <img[^>]*>

    Thanks

    maximus_vj

  •  07-13-2007, 2:48 AM 32903 in reply to 32770

    Re: identifying img tags in HTML

    maximus_vj,

    My opinion is that B will be the more efficient because, as the regex engine works its way through the source text, it has a 'simple' test to make: is this a '>' or not. I would also prefer to use this pattern (as long as it does the job) because it is easier to read.

    Option A would make the regex engine (probably - it depends on the engine you are using) walk through checking everything, saving its state as it goes, and then backtracking at some point if/when it has gone too far. I realise that the non-greedy ".*?" will help in this situation, but I really don't know how much.

    If you are starting to think of things like performance, then I would suggest reading Friedl's excellent book "Mastering Regular Expressions" as it gives a lot of background into how the various regex engines work though this type of pattern, and more/less efficient techniques.

    I make a comment earlier "as long as it does the job" because option B will stop at the first '>' which may or may not be the terminating character that matches the initial '<img'. (Actually option A will do the same - so it may have the same 'problem' if there is one).

    Susan 

  •  08-30-2007, 12:04 AM 34262 in reply to 32770

    Re: identifying img tags in HTML

    In singleline mode, regex A and B are completely equivalent, as far as what they will match. Note that the two [\s]* sequences in regex A are meaningless, as far as changing what the regex matches. So regex A can become <img.*?>, which is also the same as <img[^>]*?> (note that I've also removed the capturing group for these, since that added nothing but performance overhead if you don't need the captured value). So the comparison is reduced to a greedy vs. lazy quantifier. Which will have better performance depends on the regex engine you use and data you run the regexes over. Greedy quantifiers tend to have better performance when there will not be a lot of backtracking involved. In this case, the greedy quantifier allows you to avoid backtracking entirely, since [^>]* is immediately followed by >. As a result, I'd be willing to bet that, with these exact regexes, you wouldn't be able to find a single, popular regex engine where the lazy pattern would perform better with data other than <img> elements without any attributes. Regex B is preferable, although I would recommend adjusting it slightly to <img\b[^>]*>, to offer a better chance that it will match nothing but <img> elements.


    My regex-centric blog :: JavaScript regex tester
  •  07-14-2009, 9:58 AM 55147 in reply to 32770

    Re: identifying img tags in HTML

     Stevezilla00 is clever!

    Greedy vs lazy quantifier

     

    look at examples:

    http://komandin.org/regex/regex_view.html

View as RSS news feed in XML