Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

extract url from the <meta http-equiv="refresh" content="0;url=http://.....">

Last post 07-23-2008, 11:17 PM by newcyberian. 4 replies.
Sort Posts: Previous Next
  •  07-23-2008, 12:19 AM 44474

    extract url from the <meta http-equiv="refresh" content="0;url=http://.....">

    Can someone build the regular expression to extract the URL from the meta redirect tag?

     For example,

    <meta http-equiv="refresh" content="0;URL=http://www.microsoft.com/presspass/exec/billg/default.mspx">

    Consider the http-equiv and content attributes may be in different order.  I think the following is also valid HTML tag.

    <meta content="0;URL=http://www.microsoft.com/presspass/exec/billg/default.mspx" http-equiv="refresh">

     Both single quote or double quote (or no quote for the refresh) may be used.  I need the regex to get the URL, i.e. http://www.microsoft.com/presspass/exec/billg/default.mspx in the above cases.

     Thanks!


    CD / DVD Duplication
    -----------------------------
    www.newcyberian.com
  •  07-23-2008, 8:05 PM 44503 in reply to 44474

    Re: extract url from the <meta http-equiv="refresh" content="0;url=http://.....">

    This is very similar to the question in http://regexadvice.com/forums/thread/43567.aspx and the responses there may well apply here.

    Based on that, I can think of two approaches, depending on the context of the text and what is really 'valid'. Firstly, you can extract the value of the 'content' attribute and you can then process it separately. For this the pattern:

    <meta(\s*(content=(['"])(((?!\3).)*)\3|[\w-]+(\s*=\s*\S*)?))+\s*/?>

    will give you the value in match group #4.

    Second, if the 'content=' attribute is always followed by any text followed by a semi-colon and the value you want extends fom after the semi-colon to the end of the quoted string, then:

    <meta((?!content=).)*content=(["'])[^;]*;url=(((?!\2).)*)\2

    will give you just the URL text value in match group #3. In this case, the search will stop when the 'content=' attribute is found, whereas the first option will find ALL attribute/value pairs and extract the value from the one you want.

    By the way, I've used the 'ignore case' option for both of these patterns.

    You can simplify these patterns a bit of you can accept a quoted string that might start with a single-quote and end with a double-quote (or vice-versa), then the second pattern can become:

    <meta((?!content=).)*content=["'][^;]*;url=([^'"]*)["']

    and the text you want will be in match group #2.

    Susan 

  •  07-23-2008, 9:30 PM 44506 in reply to 44474

    Re: extract url from the <meta http-equiv="refresh" content="0;url=http://.....">

    (?i)<meta\b[^>]*URL=([^"']*)

    Should cover quoted URL, group 1 is your target.  Working from the assumption that there is no HTML META tag that uses URL= other than a redirect tag, although I could be proven wrong.


  •  07-23-2008, 11:09 PM 44509 in reply to 44474

    Re: extract url from the <meta http-equiv="refresh" content="0;url=http://.....">

    Since no one is answering my post so I had to try it by myself and this seemed to work

     <meta\s+(http-equiv|name|content)=['\"]?([^'\"]+)['\"]?\s+(http-equiv|name|content)=['\"]?([^'\"]*)['\"]?\s*/?>

    Do a test on \1 to see if  it's "content", if yes, then \2 will have the url I need.  Otherwise test on \3 to see if it's "content"; if yes, \4 will have the url.  Of course you will need to parse out the url from \2 or \4.

     


    CD / DVD Duplication
    -----------------------------
    www.newcyberian.com
  •  07-23-2008, 11:17 PM 44510 in reply to 44509

    Re: extract url from the <meta http-equiv="refresh" content="0;url=http://.....">

    Sorry to have jumped to conclusion without checking.  Susan's approaches work too.

    CD / DVD Duplication
    -----------------------------
    www.newcyberian.com
View as RSS news feed in XML