Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

extracting urls from a text file

Last post 01-22-2010, 12:07 AM by freeriders. 9 replies.
Sort Posts: Previous Next
  •  01-17-2010, 1:47 PM 58514

    extracting urls from a text file

    Greetings Folk

    I'm a php rookie and I have hard time with Reg Ex

    I know that my request sounds like been covered million time but honestly I have done my home work, tried several pattern, I have googled, I have tried at to learn at http://regexlib.com/ but I just can't get it work

    let say that I have a text named mytext.txt

    the file has the following content


    ******************************
    http://www.vitoriakitecenter.com.br/        to totototo  some texte here www.northcoastkiteboarding.com  http://kitesurfaustralia.com.au o totototo  some texte here www.ruakakakitesports.co.nz o totototo  some texte here <a href="http://www.airbossworld.co.uk" style='color:#000000' target='_blank'>www.airbossworld.co.uk</a>
    **************************

    I wish to put in an array all the urls (specially if they are NOT within the <a href=" tag )

    to get the content of my text file in a string I do


    $toto = file_get_contents('mytext.txt');

    so no problem to have the string with all the content of mytext.txt

    I just can't manage to use preg_match_all to be able to extract all possible way something that look like an url

    so would like the following output



    Array

    (

        [0] => http://www.vitoriakitecenter.com.br

        [1] => www.northcoastkiteboarding.com

        [2] => http://kitesurfaustralia.com.au

        [3] => www.ruakakakitesports.co.nz

        [4] => http://www.airbossworld.co.uk

    )

    I have tried several snipets out there but can't manage to get it work

     

    here is the function I tired to work

    but nothing seem to work properly

        function extract_url_from($string)
        {
        $pattern = '/href[^\>]*\//i';
        $pattern ='/<a[^>]*>(.*)<\/a>/';
        $pattern ='/^http\:\/\/\S+$/i';
        $pattern = '/<a[^\>]*>/i';
        preg_match_all($pattern,$string,$out,PREG_PATTERN_ORDER);
       
       
        return $out;   
        }

     

    any tips would be much much much appreciated

    thank you very much in advance for your help

    Stefany

  •  01-17-2010, 8:24 PM 58518 in reply to 58514

    Re: extracting urls from a text file

    If, as you say, you've done your homework, you will have seen "a million times" that using a regex to match a URL can never be 100% accurate. In your case, you have not told us what you have used to define a URL and I cannot tell from any of the example patterns you have included as none of them would appear to generate all of the output strings based on your test text.

    The best I could come up with is:

    (https?://)?\w+(\.\w+)+

    which generates all of the output patterns you specify plus the "www.airbossworld.co.uk" that is the test between the final "<a" and "</a>" tags.

    You may need the "ignore case" option set, as well as escaping various characters depending on the regex delimiters you are using.

    I've assumed that a URL is a string that optionally begins "http://" or "https://" and is a sequence of alphanumeric characters separated by periods. It cannot contain any other punctuation. This will generally find the domain part of the complete URL but will ignore any other part of the URL such as parameters. However this would appear to suit your needs based on your example.

    Susan

  •  01-18-2010, 1:24 AM 58520 in reply to 58518

    Re: extracting urls from a text file

    Greetings Susan

     

    First of all, thank you very much for answer my question, I appreciate it very much

    Second, yes I do believe I have did my homework (even if I have not always understood what I have readen so far) and yes seen there is not 'perfect' solution to achieve my goal, that is exactly one of the main reason that leaded me to post here : asking the experts.

     what do you mean by  "what you have used to define a URL", I'm sorry but english is not my mother tongue. I actually have a text file full of urls, in different format (either link, or plain text, some have http/https some only www), those urls are in a middle of text that I'm not wishing to use (let say that I concider it has junk) and I wish to extract all those url, my goal is to put them into an array to then be able to feed one field of my database. if that is not what you meant by "define the url" please let me know and I would be more than please to try to be more clear.

    to achieve my goal I first I load all the content of thise file in a variable doing $string = file_get_contents('mytext.txt');

     As you have advised me, I have tried

    $pattern = '(https?://)?\w+(\.\w+)+';
    $titi = preg_match_all($pattern,$string,$out);

    but this return me an error, as follow

    Warning: preg_match_all() [function.preg-match-all]: Unknown modifier '?' in L:\localhost\www\scripts_utiles\extract_urls\extract_urls.php on line 87

    what do I do wrong? 

     Thank you very much for your help and please forgive my ignorance, I'm all hear to learn more

     

    Steffy

    ps: you are correct I don't need the parameters in the url, actually I just need the domaine name and the tld
  •  01-18-2010, 1:57 AM 58521 in reply to 58520

    Re: extracting urls from a text file

    Define URL: There are different types of URL's out there. some examples are: C:/, file:/// https://, http://, www, ftp://, localhost, ip address (192.168.251.1)

    All of these can be considered a type of url. Are you planning on using all of these? Susan's regex use the assumption that all url's start with http or https however has allowed for it to be optional as some links simply start with www or the domain.

    Sorry, i don't know enough PHP to help you debug.

    One piece of advice i can give is if you are having problems making a general regex that solves all of your different URL patterns, you could try to approach it by creating multiple strict regex that solve one different pattern each and simply use an or to string them together into 1 large regex. Not my favorite way of approaching regex (especially if you are using unnamed groups in your replacement strings), but it is always a good fallback.

  •  01-18-2010, 3:13 AM 58524 in reply to 58520

    Re: extracting urls from a text file

    You don't have delimeters in your patterns. Try

     $pattern = '{(https?://)?\w+(\.\w+)+}';


    http://portal-vreme.ro
  •  01-18-2010, 3:16 AM 58525 in reply to 58524

    Re: extracting urls from a text file

    (?xi)
    (?![^<>]*+>)(https?://)?
    (  
        (?# expressed as a DOMAIN and TLD)
        (?:
            (?#DOMAIN NAME, from rfc 3696 expected [alpha | digits | hyphen], but not starting or ending width hyphen )
            (?:(?!-)[a-z\d-]+(?<!-)\.)+     

            (?#TLD [alpha | digits | hyphen], but not starting or ending width hyphen AND can not contain only digits)
            ((?!-)(?!\d+\b) [a-z\d-]+(?<!-))
         ) 

    | (?# OR)

        (?# expressed as IP address)
        (?:   
            (?:\d{1,3}\.){3}\d{1,3}
        )
    )

    (?# optional port number)
    (?::(\d+))?

    (?# optional path [request_uri])
    (/
        (?:
            (?:[a-z\d_.!~*\'():@&=+$,-]|%[\da-f]{2})+
            /?
         )+
    )?

    (?# optional query_string)
    (?:
         \?
         ((?:[a-z\d;/?:@&=+$,_.!~*\'()-]|%[\da-f]{2})+)
    )?
    (?# #fragment)
    (?:
          \x23
          ((?:[a-z\d;/?:@&=+$,_.!~*\'()-]|%[\da-f]{2})*)
    )?

    http://portal-vreme.ro
  •  01-18-2010, 4:58 PM 58561 in reply to 58520

    Re: extracting urls from a text file

    Steffy,

    I think the others have answered your questions (the source of the error and my comment about "your definition of a URL").

    If not, then get back to us.

    Susan

  •  01-19-2010, 3:52 AM 58582 in reply to 58561

    Re: extracting urls from a text file

    Dear Susan,Anglaissam, Killahbeez

     first all all THANK YOU very very very much for all your help, time and support.. you really all made my day

    Killahbeez,  $pattern = '{(https?://)?\w+(\.\w+)+}'; this works as a killer !! it returns me a multi dimension array as 

    Array
    (
    [0] => Array
    (
    [0] => http://www.vitoriakitecenter.com.br
    [1] => northcoastkiteboarding.com
    [2] => http://www.kitesurfaustralia.com.au
    [3] => www.ruakakakitesports.co.nz
    [4] => http://www.airbossworld.co.uk
    [5] => www.airbossworld.co.uk
    [6] => www.newlink.com
    )

    [1] => Array
    (
    [0] => http://
    [1] =>
    [2] => http://
    [3] =>
    [4] => http://
    [5] =>
    [6] =>
    )

    [2] => Array
    (
    [0] => .br
    [1] => .com
    [2] => .au
    [3] => .nz
    [4] => .uk
    [5] => .uk
    [6] => .com
    )

    )

     EXACTLY what I wanted.. I can then just work with the [0]... Thank you thank you and thank you again

    As I told you, I'm willing to learn, so it is not because I have reach my goal that all is done for me (thank to your pattern killahbeez)..

    So I have hard time to understand what's one post 58525 (second answer of killahbeez)

    what language is it? is that meant to be made as a case statment?

     Thank you again all

     best regards

     

    Steffy

     

     

     

  •  01-19-2010, 5:32 PM 58622 in reply to 58582

    Re: extracting urls from a text file

    freeriders:

    So I have hard time to understand what's one post 58525 (second answer of killahbeez)

    It's actually a regex pattern for all of the various parts of a URL. It includes comments (those are the groups that look like '(?# comment test here)' ) and has been split over multiple lines and indented to show the overall structure of the various parts. (The '(?xi)' right at the very beginning sets the "x" or "extended" option which tells some regex interpreters to allow comments and ignore whitespace which means that it also ignores line breaks etc - this is a useful way of layout a complex pattern so that it can be understood in parts).

    I think he was making the point that using a regex pattern to locate and identify a URL is a very complex process and even the one he presented is not 100% (none are - there are some rather strange aspects that make up a URL that conforms to the "standard").

    Susan

  •  01-22-2010, 12:07 AM 58710 in reply to 58622

    Re: extracting urls from a text file

    Thank you Susan, it is clear to me now... I have tried the first Regex he sent me, it worked good but some stuff like " 123.45 " was also collected.. but with the additionnal information it looks clear that I will certainly have to use several regex, one after the other to reach my point

    Thank you again to all, this gave me a good start

     

    Steffy

View as RSS news feed in XML