Got more questions? Find advice on: ASP | SQL | XML | Windows
in Search
Welcome to RegexAdvice Sign in | Join | Help

PHP scrape and regex problems ... how to exclude?

Last post 12-01-2008, 2:47 PM by Tomas Johansson. 18 replies.
Page 1 of 2 (19 items)   1 2 Next >
Sort Posts: Previous Next
  •  11-28-2008, 6:29 PM 48936

    PHP scrape and regex problems ... how to exclude?

    Got a problem ... I've got a php-script that scrapes a schedule system for a school ... The output/feed will be displayed as a ticker (Jticker) at the bottom of information monitors around the buildings. The scrape works just fine and the feed as well, I've even managed to modify it so that it passes with no-errors at feedvalidator.org ...

    This is the html part that I'm after: 

    <TR>
    <TD style='border-left:1px solid #888888; border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>09:15-15:00</font></TD>
    <TD style='border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>SJK52</font></TD>
    <TD style='border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>B126</font></TD>
    <TD style='border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>StObe</font></TD>

    </TR>

    What I want is to only scrape the first three <TD> and skip the fourth ... no success.

    This is the PHP-source:

    <?php

    // SCRAPE
    $url = "http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-";
    $data = implode("", file($url)); 

    // ALL ITEMS
    preg_match_all ("/<font class=\'a_text\'>([^`]*?)<\/TR>/", $data, $matches);

    // START FEED
    header ("Content-Type: text/xml; charset=ISO-8859-1");
    echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
    ?>
    <rss version="2.0"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:content="http://purl.org/rss/1.0/modules/content/"
      xmlns:admin="http://webns.net/mvcb/"
      xmlns:atom="http://www.w3.org/2005/Atom"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <channel>
            <title>Lektioner:</title>
            <description>Campus Varberg</description>
            <link>http://landris.hh.se</link>
            <atom:link href="http://www.campus.varberg.se/dev/scrape/schema.php" rel="self" type="application/rss+xml" />
            <language>se-SE</language>


    <?
    // SINGLE ITEM
    foreach ($matches[0] as $match) {

        // TITLE
        preg_match ("/<font class=\'a_text\'>([^`]*?)<\/TR>/", $match, $temp);
        $title = $temp['1'];
        $title = strip_tags($title);
        $title = str_replace('&nbsp;', '', $title);
        $title = trim($title);

        // DESCRIPTION
        preg_match ("/<font class=\'a_text\'>([^`]*?)<\/TR>/", $match, $temp);
        $text = $temp['1'];
        $text = strip_tags($text);
        $text = trim($text);
        $text = str_replace('&nbsp;', '', $text);

        // GUID
        $token = md5 (uniqid ());

        // RSS XML
        echo "<item>\n";
            echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
            echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
            echo "\t\t\t<content:encoded><![CDATA[ \n";
            echo $text . "\n";
            echo " ]]></content:encoded>\n";
            echo "<guid>http://www.campus.varberg.se/$token.html</guid>\n";
        echo "\t\t</item>\n";
    }
    ?>

    </channel>

    </rss>

     

    I'm trying to use regex to skip the fourth <TD> ... is it possible?

    If I use: 

    preg_match_all ("/font class=\'a_text\'>([^`]*?)<\/TR>", $data, $matches);

     

    I get the whole<TR> 

    I've tried to add {3} inside the regex like this:

     

    preg_match_all ("/font class=\'a_text\'>([^`{3}]*?)<\/TR>", $data, $matches); 

     

    But it really doesnt do the trick, if I put {1} instead of {3} I ONLY get the fourth <TD>

    Live urls: 

    Site: http://www.campus.varberg.se/dev/
    Feed: http://www.campus.varberg.se/dev/scrape/schema.php

    Any help is regarded, (maybe I'll give you an appreciation on the templates for the monitors so if you'd come to Varberg, Sweden you'd see your name at Campus Varberg ;) ) ... Tomas 

    Filed under: , ,
  •  11-29-2008, 3:06 AM 48945 in reply to 48936

    Re: PHP scrape and regex problems ... how to exclude?

    Tomas Johansson:

    Got a problem ... I've got a php-script that scrapes a schedule system for a school ... The output/feed will be displayed as a ticker (Jticker) at the bottom of information monitors around the buildings. The scrape works just fine and the feed as well, I've even managed to modify it so that it passes with no-errors at feedvalidator.org ...

    This is the html part that I'm after: 

    <TR>
    <TD style='border-left:1px solid #888888; border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>09:15-15:00</font></TD>
    <TD style='border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>SJK52</font></TD>
    <TD style='border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>B126</font></TD>
    <TD style='border-bottom:1px solid #888888; border-right:1px solid #888888; ' nowrap><font class='a_text'>StObe</font></TD>

    </TR>

    What I want is to only scrape the first three <TD> and skip the fourth ... no success.
    ...

    Hi, I didn't read the entire script you posted, but to match only the first three TD's inside a TR tag, you could do  something like this:

    if(preg_match_all('#<td\s+(?:(?!</td>).)*</td>(?!\s*</tr>)#is', $text, $matches)) {
      // ...
    }

  •  11-29-2008, 3:25 AM 48946 in reply to 48936

    Re: PHP scrape and regex problems ... how to exclude?

    Will there always be exactly 4 TD's in your source?  If there might be more than 4 is it always the last one you need to skip?

    In your code you target just seems to be the data in between each <font class='a_text'> and </font>, is that your target?

    Please explain your goal further so that your code can be simplified, it seems like you are capturing unwanted data that you then later just strip out with string functions.


  •  11-29-2008, 3:45 AM 48947 in reply to 48946

    Re: PHP scrape and regex problems ... how to exclude?

    Well there will always be four TD's with the class 'a_text' (no more or less) but the source contains a lot more TD's and TR's but the problem is that they don't have anu class or ID applied. And yes I'm only interessted in the data between <font class='a_text'> and </font> except for the one in the fourth TD ... which I am trying to exclude.

    I tried "prometheuzz" suggestion and it kind of worked but it also included some of the TD's I don't want. 

    This is the link to the source I want to scrape:

    http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-

     

  •  11-29-2008, 4:13 AM 48948 in reply to 48947

    Re: PHP scrape and regex problems ... how to exclude?

    How about this:

    $text = file_get_contents('http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-');
    $regex = "#<td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>(?!\s*</tr>)#isx';
    if(preg_match_all($regex, $text, $matches)) {
      print_r($matches);
    }

    Or like this where the $matches are perhaps easier to handle:

    $text = file_get_contents('http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-');
    $regex = "#
        <tr>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>
    #isx";
    if(preg_match_all($regex, $text, $matches)) {
      print_r($matches);
    }

  •  11-29-2008, 6:22 AM 48953 in reply to 48948

    Re: PHP scrape and regex problems ... how to exclude?

    Have to confess that 'm quite a newbie at Regex and PHP as well ... more of an open-source-leech :) Normaly I manage to adjust the code so that it suits my needs, but this time I'm up against the wall ;) Thanks for all your help so far ...

    This is how I implemented your last code-snippet, but it only gives me the data from the first TD not the second and third. The first regex you posted kind of worked (it stripped away the fourth TD, but instead it gave me a lot of empty 'items' in my feed ... probably the ones that where stripped away:

     

    <?php

     

    // Get page

    $url = "http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-";

    $data = implode("", file($url)); 

     

    // Get content items

    preg_match_all ("#<TR>\s*<TD\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</TD>\s*<TD\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</TD>\s*<TD\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</TD>#isx', $data, $matches);

     

    // Begin feed

    header ("Content-Type: text/xml; charset=ISO-8859-1");

    echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";

    ?>

    <rss version="2.0"

      xmlns:dc="http://purl.org/dc/elements/1.1/"

      xmlns:content="http://purl.org/rss/1.0/modules/content/"

      xmlns:admin="http://webns.net/mvcb/"

      xmlns:atom="http://www.w3.org/2005/Atom"

      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

        <channel>

            <title>Lektioner:</title>

            <description>Campus Varberg</description>

            <link>http://landris.hh.se</link>

            <atom:link href="http://www.campus.varberg.se/dev/scrape/schema.php" rel="self" type="application/rss+xml" />

            <language>se-SE</language>

     

     

    <?

    // Loop through each content item

    foreach ($matches[0] as $match) {

        // First, get title

        preg_match ("#<TR>\s*<TD\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</TD>\s*<TD\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</TD>\s*<TD\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</TD>#isx', $match, $temp);

        $title = $temp['1'];

        $title = strip_tags($title);

        $title = str_replace('&nbsp;', '', $title);

        $title = trim($title);

     

     

     

        // Third, get text

        preg_match ("/<font class=\'a_text\'>([^`]*?)<\/font>/", $match, $temp);

        $text = $temp['1'];

        $text = strip_tags($text);

        $text = trim($text);

        $text = str_replace('&nbsp;', '', $text);

     

    $token = md5 (uniqid ());

     

        // Echo RSS XML

        echo "<item>\n";

            echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";

            echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";

            echo "\t\t\t<content:encoded><![CDATA[ \n";

            echo $text . "\n";

            echo " ]]></content:encoded>\n";

    echo "<guid>http://www.campus.varberg.se/$token.html</guid>\n";

        echo "\t\t</item>\n";

    }

    ?>

     

    </channel>

    </rss> 

  •  11-29-2008, 6:33 AM 48954 in reply to 48953

    Re: PHP scrape and regex problems ... how to exclude?

    Tomas Johansson:

    Have to confess that 'm quite a newbie at Regex and PHP as well ... more of an open-source-leech :) Normaly I manage to adjust the code so that it suits my needs, but this time I'm up against the wall ;) Thanks for all your help so far ...

    This is how I implemented your last code-snippet, but it only gives me the data from the first TD not the second and third. The first regex you posted kind of worked (it stripped away the fourth TD, but instead it gave me a lot of empty 'items' in my feed ... probably the ones that where stripped away:

     ...

    And the second? It prints all the three things you're interested in in three separate arrays. When I execute my second example, I get the following output:

        [1] => Array
            (
                [0] => 08:15-14:00
                [1] => 09:00-12:00
                [2] => 09:00-12:00
                [3] => 09:00-12:00
                [4] => 09:00-13:30
                [5] => 09:00-17:00
                [6 ] => 09:15-12:00
                [7] => 09:15-15:00
                [8 ] => 09:30-12:30
                [9] => 10:00-12:30
                [10] => 10:00-14:00
                [11] => 10:00-15:00
                [12] => 12:15-13:30
                [13] => 13:00-15:00
                [14] => 13:00-16:00
                [15] => 13:00-17:00
                [16] => 14:30-16:45
                [17] => 16:00-18:00
                [18] => 17:00-20:00
                [19] => 17:00-20:30
                [20] => 18:00-20:30
                [21] => 08:00-16:00
                [22] => 09:00-12:00
                [23] => 09:00-12:00
                [24] => 09:00-15:00
                [25] => 09:00-16:00
                [26] => 09:15-12:00
                [27] => 09:15-12:00
                [28] => 09:15-12:00
                [29] => 09:15-12:30
                [30] => 09:15-15:00
                [31] => 09:15-16:00
                [32] => 09:30-12:30
                [33] => 09:30-12:30
                [34] => 10:00-14:00
                [35] => 10:15-12:00
                [36] => 10:15-14:30
            )

        [2] => Array
            (
                [0] => NTB08
                [1] => EVM0708
                [2] => LUF05
                [3] => PDM08
                [4] => LUF08
                [5] => SOP08
                [6 ] => EPP08PMT08
                [7] => SJK52
                [8 ] => GAS07
                [9] => MAN08
                [10] => �r
                [11] => EVM08
                [12] => MUS06
                [13] => MAN08
                [14] => PDM08
                [15] => GAS07
                [16] => LUF07
                [17] => H�08
                [18] => PEK08
                [19] => PRJ08
                [20] => J�08
                [21] => EVI08
                [22] => EVM0708
                [23] => PMT07
                [24] => LUF06
                [25] => PDM08
                [26] => SJK48
                [27] => BCP08
                [28] => EPP08PMT08
                [29] => LUF08
                [30] => SJK46
                [31] => SYV08
                [32] => MAN07
                [33] => GAS07
                [34] => MAN06
                [35] => BCP07
                [36] => NTB08
            )

        [3] => Array
            (
                [0] => B113, B114
                [1] => C303
                [2] => A148
                [3] => E308
                [4] => C302
                [5] => B127
                [6 ] => E307
                [7] => B126
                [8 ] => B217
                [9] => B130
                [10] => B125
                [11] => E218
                [12] => A210
                [13] => A148
                [14] => A146
                [15] => B217
                [16] => B128
                [17] => B130
                [18] => A210
                [19] => C302
                [20] => A209
                [21] => B127
                [22] => C303
                [23] => A148
                [24] => E410
                [25] => E313
                [26] => A342
                [27] => C302
                [28] => E307
                [29] => B128
                [30] => B312
                [31] => E116
                [32] => A208
                [33] => B217
                [34] => B125
                [35] => A209
                [36] => B113, B114
            )

    Could you tell me what is NOT matched by this? Or if the opposite is true, could you indicate what text is matched that should not have been matched?

  •  11-29-2008, 6:49 AM 48955 in reply to 48953

    Re: PHP scrape and regex problems ... how to exclude?

    The same regex, but a different presentation:

    <?php
    $text = file_get_contents('http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-');
    $regex = "#
        <tr>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>
    #isx";
    if(preg_match_all($regex, $text, $matches)) {
      for($j = 0; $j < sizeof($matches[0]); $j++) {
        for($i = 1; $i <= 3; $i++) {
          echo  $matches[$i][$j] . "<br />";
        }
        echo "------------------<br />";
      }
    }
    ?>

  •  11-29-2008, 10:37 AM 48957 in reply to 48955

    Re: PHP scrape and regex problems ... how to exclude?

    Possibly seeing the output in a format that you are familiar with may help as well:

    <pre>
    <?php
    $text 
    file_get_contents('http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-'
    );
    $regex 
    "#
        <tr>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>
    #isx"
    ;
    if(
    preg_match_all($regex$text$matches
    )) {
        for (
    $i 0$i count($matches[0]); $i
    ++) {
            echo 
    "title:".$matches[1][$i]."\n"
    ;
            echo 
    "data:".$matches[2][$i]."\n"
    ;
            echo 
    "data:".$matches[3][$i]."\n"
    ;
            echo 
    "<hr>"
    ;
        }
    } else {
        echo 
    "err...something went wrong."
    ;
    }
    ?>

    If you would like to see a specific example with your RSS code referenced please show what output you would like, I don't see the second data group (capture group 3) referenced in your code.


  •  11-29-2008, 11:48 AM 48959 in reply to 48955

    Re: PHP scrape and regex problems ... how to exclude?

    Well as I said before I want to use the output as a feed, there are two options

    1. The three first TD's "outputed" as one Item as the Title.

    2 Each TD is "outputed" as a single Item

     The three TD's I'm trying to scrape represents 

    1 Time ie (08:30-12:00)

    2 Course ie  (PDM06)

    3 Lecture hall number ie (E116)

    The TD I'm trying to exclude contains and internal code which has no meaning for the students ... only confuses them.

    The second $match i use in my script is only to give the feed and "description", the reason for that is that the component that I use as "feed-reader" won't validate it otherwise, the second $match could just be a "mirror" of the first $match since the students only would se the feeds Item titles in the "ticker". 

     

    The final feed should look something like this:

    <item>

    <title>08:30-12:00 PDM06 E116</title>

    <description>008:30-12:00 PDM06 E116</description>

    <guid>http://www.campus.varberg.se/xxxxxxxx.html</guid>

    </item>

    OR in case two ... which I actually would prefer:

     

    <item>

    <title>08:30-12:00/title>

    <description>08:30-12:00</description>

    <guid>http://www.campus.varberg.se/xxxxxxxx.html</guid>

    </item>

     

    <item> 

     

    <title>PDM06</title>

    <description>PDM06</description>

    <guid>http://www.campus.varberg.se/xxxxxxxx.html</guid>

    </item>

     

    <item> 

    <title>E116</title>

    <description>E116</description>

    <guid>http://www.campus.varberg.se/xxxxxxxx.html</guid>

    </item>

     

    As I said in my previous post I'm quite a newbie so I can't realy figure out how to implement your script into mine so that the arrays are "outputed" as and XML compatible feed. 

    Maybe I'm asking to much from you guys and I'm really glad for all your help so far.  

  •  11-29-2008, 1:16 PM 48961 in reply to 48959

    Re: PHP scrape and regex problems ... how to exclude?

    See if this outputs as you need:

    <?php
    $text 
    file_get_contents('http://landris.hh.se/4DACTION/WebShowRoll/1-21?offset=4320&update=0&rows=0&page=0&branch=4&group=-21&start=yes&stop=yes&order=ascending&web_cols=1&web_numChars=-'
    );
    $regex 
    "#
        <tr>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>\s*
        <td\s+style=[^>]+>\s*<font\s*class='a_text'>\s*([^>]*)</font>\s*</td>
    #isx"
    ;
    if(
    preg_match_all($regex$text$matches
    )) {
        for (
    $i 0$i count($matches[0]); $i
    ++) {
            for (
    $j 1$j 4$j
    ++) {
                echo 
    "<item>\n\t<title>".$matches[$j][$i]."</title>\n\t<description>".$matches[$j][$i]."</description>\n\t<guid>http://www.campus.varberg.se/".md5(uniqid()).".html</guid>\n</item>\n\n"
    ;
            }
        }
    } else {
        echo 
    "err...something went wrong."
    ;
    }
    ?>

    output snippet:

    <item>
        <title>08:15-14:00</title>
        <description>08:15-14:00</description>
        <guid>http://www.campus.varberg.se/ea260b082cfd5ecd59332cbad0a0d8c8.html</guid>
    </item>

    <item>
        <title>NTB08</title>
        <description>NTB08</description>
        <guid>http://www.campus.varberg.se/1ac36b595f8be362a1db4d5ef49f9381.html</guid>
    </item>

    <item>
        <title>B113, B114</title>
        <description>B113, B114</description>
        <guid>http://www.campus.varberg.se/b849f327aa7fe9e15cc3b9c0121ccec4.html</guid>
    </item>

    <item>
        <title>09:00-12:00</title>
        <description>09:00-12:00</description>
        <guid>http://www.campus.varberg.se/81f06aa2255acc85690e12b77e4feb96.html</guid>
    </item>

    <item>
        <title>EVM0708</title>
        <description>EVM0708</description>
        <guid>http://www.campus.varberg.se/4411a1a929b3bfeb5771e144688df001.html</guid>
    </item>


  •  11-29-2008, 2:57 PM 48964 in reply to 48961

    Re: PHP scrape and regex problems ... how to exclude?

    Great ddrudik! According to your snippet it looks exactly like what I'm trying to achive ... a question though ... how do I get the results to display a valid RSS feed? Normaly I add the $matches into an output like the one below, I need to make it a valid RSS/Atom feed for it to work properly. 

     

    // START FEED
    header ("Content-Type: text/xml; charset=ISO-8859-1");
    echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
    ?>
    <rss version="2.0"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:content="http://purl.org/rss/1.0/modules/content/"
      xmlns:admin="http://webns.net/mvcb/"
      xmlns:atom="http://www.w3.org/2005/Atom"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
        <channel>
            <title>Lektioner:</title>
            <description>Campus Varberg</description>
            <link>http://landris.hh.se</link>
            <atom:link href="http://www.campus.varberg.se/dev/scrape/schema.php" rel="self" type="application/rss+xml" />
            <language>se-SE</language>


    <?
    // SINGLE ITEM
    foreach ($matches[0] as $match) {

        // TITLE
        preg_match ("/<font class=\'a_text\'>([^`]*?)<\/TR>/", $match, $temp);
        $title = $temp['1'];
        $title = strip_tags($title);
        $title = str_replace('&nbsp;', '', $title);
        $title = trim($title);

        // DESCRIPTION
        preg_match ("/<font class=\'a_text\'>([^`]*?)<\/TR>/", $match, $temp);
        $text = $temp['1'];
        $text = strip_tags($text);
        $text = trim($text);
        $text = str_replace('&nbsp;', '', $text);

        // GUID
        $token = md5 (uniqid ());

        // RSS XML
        echo "<item>\n";
            echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
            echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
            echo "\t\t\t<content:encoded><![CDATA[ \n";
            echo $text . "\n";
            echo " ]]></content:encoded>\n";
            echo "<guid>http://www.campus.varberg.se/$token.html</guid>\n";
        echo "\t\t</item>\n";
    }
    ?>

    </channel>

    </rss> 

     

     

  •  11-29-2008, 4:20 PM 48970 in reply to 48964

    Re: PHP scrape and regex problems ... how to exclude?

    I don't know RSS, you would need to tell me how that would need to be formatted.  For example, show me (without code) how the valid feed text would appear for these items (if the items are grouped/not grouped please be specific):

    <item>
        <title>08:15-14:00</title>
        <description>08:15-14:00</description>
        <guid>http://www.campus.varberg.se/88f3d82d23d3bf0a786ddeb11e6f49f2.html</guid>
    </item>

    <item>
        <title>NTB08</title>
        <description>NTB08</description>
        <guid>http://www.campus.varberg.se/b5823149766ccd3d86e6262f8d2cdb85.html</guid>
    </item>

    <item>
        <title>B113, B114</title>
        <description>B113, B114</description>
        <guid>http://www.campus.varberg.se/717e4d57f6364845cabcacb64604084a.html</guid>
    </item>

    <item>
        <title>09:00-12:00</title>
        <description>09:00-12:00</description>
        <guid>http://www.campus.varberg.se/a61472aca226c07180cd638db0006cd8.html</guid>
    </item>

    <item>
        <title>EVM0708</title>
        <description>EVM0708</description>
        <guid>http://www.campus.varberg.se/d303da0192c885225863cbbd856ff127.html</guid>
    </item>

    <item>
        <title>C303</title>
        <description>C303</description>
        <guid>http://www.campus.varberg.se/e84b977e18e7a9ecdf5379e13a46b26e.html</guid>
    </item>


  •  11-29-2008, 6:13 PM 48976 in reply to 48970

    Re: PHP scrape and regex problems ... how to exclude?

    It would look something like this, the Items parts are normal XML. For a  real and valid feed it would also be prefered that it contained <author> and <url>, but I don't need that:


    Content-Type: text/xml; charset=ISO-8859-1
    xml version=\"1.0\" encoding=\"ISO-8859-1

    <rss version="2.0"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:content="http://purl.org/rss/1.0/modules/content/"
      xmlns:admin="http://webns.net/mvcb/"
      xmlns:atom="http://www.w3.org/2005/Atom"
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

    <channel>
    <title>Lektioner:</title>

    <description>Campus Varberg</