Technical summary: extracting text from fetched HTML pages and restating it as more semantically structured XML.
In the September 2002 issue of The Perl Journal, Derek Valada's article Parsing RSS Files with XML::RSS sang the praises of RSS feeds, and showed how even if you don't have any RSS client programs or don't use a web site that aggregates them for you, it takes just a bit of Perl to write your own little utility for viewing the RSS content in your web browser. I can testify that once you get used to having RSS feeds from a few sites, you want all your favorite sites to have them. This article is about what to do when a site doesn't have an RSS feed: make one for yourself, by writing a little Perl tool to get content from the site's HTML.
Some wonderful web sites provide RSS feeds that make it easy for us to find out when there's something interesting at their site, without actually having to go to that web site first. But some web sites just haven't gotten around to providing an RSS feed. Sometimes this is just because the site's programmers (if there are any!) just happen not to have heard about RSS yet. Or sometimes it's just that the people maintaining the site don't know that so many people actually use RSS and would appreciate an RSS feed -- the main sysadmin of one of the Net's larger web-logging sites recently told me "I've considered it a bit but haven't found any real compelling reason yet, and no-one else has seemed very interested in it".
Hopefully all the larger sites will come around to providing RSS feeds; but I bet there will always be routinely updated content on the Web that lacks RSS feeds. In that case, we have to make our own.
The first step in making an RSS feed for a remote site is checking that they really don't already have one! Unlike a /robots.txt file, the RSS feed for a site doesn't have a predictable name, nor is there even any one place on a site where it's customary to mention the URL of the RSS feed. Some sites mention the URL in their FAQ, and recently some sites have started mentioning the URL in an HTML link element in the site's HTML, like so:
<link rel="alternate" type="application/rss+xml" href="http://that_rss_url" >
If you can't find an RSS feed that way, search Google for "sitename RSS" or "sitename RDF" -- you'd be surprised how effective that is. And if that doesn't get you anywhere, email the site's webmaster and ask for the URL of their RSS. If they get enough such messages, they'll make a point of more clearly stating the RSS feed's URL if they have one, or setting one up if they don't.
But if absolutely none of these things work out, then it's time to roll up your sleeves and write an RSS generator that extracts content from the site's HTML.
Processing HTML to finding the bits of content that you want, is one of the black arts of modern programming. Not only is each web page different, but as its content varies from day to day, it may exhibit unexpected changes in its template, which your program is hopefully robust enough to deal with. In fact, most of my book Perl & LWP is an explanation of the bag of tricks that you should learn to use for writing really robust HTML-scanner programs. You also need a bit of practice, but if you don't have any experience at scanning HTML, then doing so for writing to RSS is a great place to start. In this article, I'll stick to just using regular expressions instead of more advanced approaches involving HTML::TokeParser or HTML::TreeBuilder.
The basic approach is this:
use LWP::Simple; my $content_url = 'http://whatever.url.to/get/from.html'; my $content = get($content_url); die "Can't get $content_url" unless defined $content; ...then extract things from $content...
So, for example, consider freshair.npr.org, the web site for National Public Radio's interview program Fresh Air. One page on the site has the listings for the current program, with HTML like this:
<A HREF="http://www.npr.org/ramfiles/fa/20020920.fa.01.ram">Listen to <FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3"> <B> John Lasseter </B> </FONT></A> ... <A HREF="http://www.npr.org/ramfiles/fa/20020920.fa.02.ram">Listen to <FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3"> <B> Singer and guitarist Jon Langford </B> </FONT></A> ... plus any other segments ...
The parts that we want to extract are this:
http://www.npr.org/ramfiles/fa/20020920.fa.01.ram John Lasseter http://www.npr.org/ramfiles/fa/20020920.fa.02.ram Singer and guitarist Jon Langford
We can get the page and match the content with this bit of code, whose regular expression we arrive at through a bit of trial-and-error:
use LWP::Simple; my $content_url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current'; my $content = get($content_url); die "Can't get $content_url" unless defined $content; $content =~ s/(\cm\cj|\cj|\cm)/\n/g; # nativize newlines my @items; while($content =~ m{ \s+<A HREF="([^"\s]+)">Listen to \s+<FONT FACE="Verdana, Charcoal, Sans Serif" COLOR="#ffffff" SIZE="3"> \s+<B>(.*?)</B> }g) { my($url, $title) = ($1,$2); print "url: {$url}\ntitle: {$title}\n\n"; push @items, $title, $url; }
When run, this happily produces this output, showing that it's properly matching the three segments in that page (at time of writing):
url: {http://www.npr.org/ramfiles/fa/20020920.fa.01.ram} title: {John Lasseter} url: {http://www.npr.org/ramfiles/fa/20020920.fa.02.ram} title: {Singer and guitarist Jon Langford} url: {http://www.npr.org/ramfiles/fa/20020920.fa.03.ram} title: {Film critic David Edelstein}
Later we can comment out that print
statement and add some code to write @items
to an RSS file.
Now consider this similar case where we're scanning the HTML in the Guardian's web page for breaking news:
... <A HREF="/worldlatest/story/0,1280,-2035841,00.html">Uns olved Crimes Vex Afghanistan</A><BR><B>6:50 am</B><P><A HREF="/worldlatest/story/0,1280,-2035838,00.html">Christ ians Show Support For Israel</A><BR><B>6:40 am</B><P><A HREF="/worldlatest/story/0,1280,-2035794,00.html">Schroe der's Party Wins 2nd Term</A><BR><B>5:30 am</B><P> ...
It's a great big bunch of unbroken HTML (which I've put newlines into just for readability), but look at it a bit and you'll see that each item in it is like this:
<A HREF="url">headline</A><BR><B>time</B><P>
You'll also note that items follow each other, one after another, with no intervening "</p><p>" or newlines or anything.
So we cook up that pattern we observe, into a regular expression and put it into our code template from above:
use LWP::Simple; my $content_url = 'http://www.guardian.co.uk/worldlatest/'; my $content = get($content_url); die "Can't get $content_url" unless defined $content; $content =~ s/(\cm\cj|\cj|\cm)/\n/g; # nativize newlines my @items; while($content =~ m{<A HREF="(/worldlatest/.*?)">(.*?)</A><BR><B>.*?</B><P>}g ) { my($url, $title) = ($1,$2); print "url: {$url}\ntitle: {$title}\n\n"; push @items, $title, $url; }
When we run that, that code correctly produces this list of items:
url: {/worldlatest/story/0,1280,-2035841,00.html} title: {Unsolved Crimes Vex Afghanistan} url: {/worldlatest/story/0,1280,-2035838,00.html} title: {Christians Show Support For Israel} url: {/worldlatest/story/0,1280,-2035794,00.html} title: {Schroeder's Party Wins 2nd Term} ...and a dozen more items...
We're ready to make both of these programs write their @items
to an RSS feed -- except for one thing: URLs in an RSS feed should really be absolute (starting with "http://..."),
and not relative URLs like the "/worldlatest/story/0,1280,-2035794,00.html" we got from the Guardian page.
Luckily the URI.pm
class provides a simple way to turn a relative URL to an absolute one,
given a base URL:
URI->new_abs($rel_url => $base_url)->as_string
We can use this by just adding a "use URI;
" to the start of our program,
and change the end of our while
loop to read like so:
$url = URI->new_abs($url => $content_url)->as_string; print "url: {$url}\ntitle: {$title}\n\n"; push @items, $title, $url; }
With that change made, our program emits absolute URLs, like this:
url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035841,00.html} title: {Unsolved Crimes Vex Afghanistan} url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035838,00.html} title: {Christians Show Support For Israel} url: {http://www.guardian.co.uk/worldlatest/story/0,1280,-2035794,00.html} title: {Schroeder's Party Wins 2nd Term} ...and a dozen more items...
An RSS file is a kind of XML file that expresses some data about the site in general, and then lists the details of each story item at that feed. While RSS has actually many more features than I'll discuss here (especially in later versions than the 0.91 version here), a minimal RSS file starts with an XML header, an appropriate doctype, and some metadata elements, like this:
<?xml version="1.0"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"><channel> <title> title of the site </title> <description> description of the site </description> <link> URL of the site </link> <language> the RFC 3166 language tag for this feed's content </language>
Then there's a number of item elements like this:
<item><title>...headline...</title><link>...url..</link></item>
And then the document ends like this:
</channel></rss>
So the RSS file that we would produce with our Fresh Air HTML scanner would look like this, shown here with a bit of helpful indenting:
<?xml version="1.0"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"><channel> <title>Fresh Air</title> <description>Terry Gross's interview show on NPR</description> <link>http://freshair.npr.org/dayFA.cfm?todayDate=current</link> <language>en-us</language> <item> <title>John Lasseter</title> <link>http://www.npr.org/ramfiles/fa/20020920.fa.01.ram</link> </item> <item> <title>Singer and guitarist Jon Langford</title> <link>http://www.npr.org/ramfiles/fa/20020920.fa.02.ram</link> </item> <item> <title>Film critic David Edelstein</title> <link>http://www.npr.org/ramfiles/fa/20020920.fa.03.ram</link> </item> </channel></rss>
We can break this down to three main pieces of code: one for all the stuff before the first item,
one for taking our @items
and spitting out the series of <item>...</item>
elements,
and then one to complete the XML by outputting </channel></rss>
.
But first there's one consideration -- we can't really just take the code we pulled out of the HTML and dump it into XML.
The reason for this is that XML is much less forgiving than HTML is,
notably with &foo; codes,
or "character entity references",
as they're called.
That is,
the HTML could have this:
<a href="...">Proctor & Gamble to merge with H&R Block</a>
That's acceptable (if not proper) HTML -- but it's strictly forbidden in XML,
and will cause any XML parser to reject that document.
In XML,
if there's a &,
it must be the start of a character entity reference,
and if you mean a literal "&",
it must be expressed as with just such a &foo; code -- typically as &
,
but also possibly as &
or &
.
Moreover,
just because something is a legal HTML &foo; code doesn't mean it's legal is an RSS file.
For sake of compatibility,
the RSS 0.91 DTD (at the URL you see in the <!DOCTYPE...>
declaration) defined the same &foo; codes as HTML -- but that was HTML of several years ago,
back when there were just codes for the Latin-1 characters 160 to 255.
That gets you codes like é
,
but if you try using more recent additions like €
or —
,
the RSS parser will fail to parse that document.
So just to be on the safe side,
we should decode all the &foo; codes in the HTML,
and then re-encode everything,
except using numeric codes (like {
),
since those are always acceptable to XML parsers.
And while we're at it,
we should kill any tags inside that HTML,
in case the HTML that we captured happens to contain some <br>
's,
which would make this malformed as XML.
To do the &foo; decoding,
we can use the ever-useful HTML::Entities module (available in CPAN as part of the HTML-Parser distribution).
Then we do a little cleanup and just use a simple regexp to replace each unsafe character (like &
or é
) with a &#number; code.
The eminently re-useable routine for doing that looks like this:
use HTML::Entities qw(decode_entities); sub xml_string { # Take an HTML string and return it as an XML text string local $_ = $_[0]; # Collapse and trim whitespace s/\s+/ /g; s/^ //s; s/ $//s; # Delete any stray HTML tags s/<.*?>//g; decode_entities($_); # Substitute or strike out forbidden MSWin characters! tr/\x91-\x97/''""*\x2D/; tr/\x7F-\x9F/?/; # &-escape every potentially unsafe character s/([^ !#\$%\x28-\x3B=\x3F-\x7E])/'&#'.(ord($1)).';'/seg; return $_; }
Once we've got the xml_string
routine defined as above,
we can then use that in a routine that takes the contents of our @items
(alternating title and URL),
and returns XML of it as a series of <item>...</item>
elements,
like so:
sub rss_body { my $out = ''; while(@_) { $out .= sprintf " <item>\n\t<title>%s</title>\n\t<link>%s</link>\n </item>\n", map xml_string($_), splice(@_,0,2); # get the first two each time } return $out; }
We can test that routine by doing this:
print rss_body("Bogodyne rockets > 250½/share!", "http://test");
Its output is this:
<item> <title>Bogodyne rockets > 250½/share!</title> <link>http://test</link> </item>
This is correct,
since <
is XMLese for "a literal <
character",
and ½
is for "a literal ½ character".
Since this is all working happily,
we can make another routine for the start of the XML document:
sub rss_start { return sprintf q[<?xml version="1.0"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"><channel> <title>%s</title> <description>%s</description> <link>%s</link> <language>%s</language> ], map xml_string($_), @_[0,1,2,3]; # Call with: title, desc, URL, language! }
...and for the end:
sub rss_end { return '</channel></rss>'; }
Then spitting out the bare-bones RSS XML that we're after, is just a matter of calling:
print rss_start( "Guardian World Latest", "Latest Headlines from the Guardian", $content_url, 'en-GB', # language tag for UK English ), rss_body(@items), rss_end() ;
Run the program, and it indeed spits out this valid XML:
<?xml version="1.0"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"><channel> <title>Guardian World Latest</title> <description>Latest Headlines from the Guardian</description> <link>http://www.guardian.co.uk/worldlatest/</link> <language>en-GB</language> <item> <title>Unsolved Crimes Vex Afghanistan</title> <link>http://www.guardian.co.uk/worldlatest/story/0,1280,-2035841,00.html</link> </item> <item> <title>Christians Show Support For Israel</title> <link>http://www.guardian.co.uk/worldlatest/story/0,1280,-2035838,00.html</link> </item> <item> <title>Schroeder's Party Wins 2nd Term</title> <link>http://www.guardian.co.uk/worldlatest/story/0,1280,-2035794,00.html</link> </item> ... <item> <title>Yemen Holds 104 Terror Suspects</title> <link>http://www.guardian.co.uk/worldlatest/story/0,1280,-2035650,00.html</link> </item> </channel></rss>
To do the same for our Fresh Air program,
we just append the same code,
just changing the parameters to rss_start
,
like so:
print rss_start( "Fresh Air", "Terry Gross's interview show on National Public Radio", $content_url, 'en-US', # language tag for US English ), rss_body(@items), rss_end() ;
Running that program returns this RSS expression of our @items
:
<?xml version="1.0"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"><channel> <title>Fresh Air</title> <description>Terry Gross's interview show on National Public Radio</description> <link>http://freshair.npr.org/dayFA.cfm?todayDate=current</link> <language>en-US</language> <item> <title>John Lasseter</title> <link>http://www.npr.org/ramfiles/fa/20020920.fa.01.ram</link> </item> <item> <title>Singer and guitarist Jon Langford</title> <link>http://www.npr.org/ramfiles/fa/20020920.fa.02.ram</link> </item> <item> <title>Film critic David Edelstein</title> <link>http://www.npr.org/ramfiles/fa/20020920.fa.03.ram</link> </item> </channel></rss>
And because we put everything through our xml_escape
routine,
the XML text is always properly escaped,
even if our original HTML scanner regexp happens to trap an HTML tag or malformed &foo; code.
That's all there is to making a basic RSS generator program. The only question left is how to have it run.
There's two main ways for an RSS generator program to get used -- either it should run as a CGI and send output to the browser on demand, or it should be run periodically via cron, and save output to a file that can be accessed at some URL.
The mechanics are simple. If the program is to run as a CGI, just start its output out with a MIME header like so:
print "Content-type: application/rss+xml\n\n", rss_start( ... and the rest, as before ...
If you want to save the output to a file, instead do this:
my $outfile = '/home/jschmo/public_html/freshair.rss'; open(OUTXML, ">$outfile") || die "Can't write-open $ouffile: $!\nAborting"; print OUTXML rss_start( ... and the rest, as before ...
The more complex issue is: under what conditions would you want to do it one way or the other way? If the program runs as a CGI, it will connect to the remote server to get the HTML, as many times as there are requests for the RSS feed. If this is a RSS feed that only you know about, and you'll only access it only a few times a day at most, then having it run as a CGI is just fine.
But if the RSS feed could be accessed quite often, then it would be more efficient for your server as well as for the remote server, if you have the RSS updater run periodically via cron, as with a crontab line like this:
13 6-17 * * 1-5 /home/jschmo/make_fresh_air_rss
That will run the program at 13 minutes past the hour between 6:13am and 5:13pm, Monday through Friday -- and those are the only times that we'll request the HTML from the server, no matter how many times the resulting RSS file gets hit. Implicit in those crontab settings is the assumption that we don't really need absolutely up-to-the-minute information (or else we'd set it to run more often, or just go back to using the CGI approach) and that there's no point in accessing the RSS data outside of those hours. Since Fresh Air is produced only once every weekday, during the day, I've judged that it's very unlikely that their HTML listings page will change outside of those hours.
You should always be considerate of the remote web server, so you should request its HTML only as often as necessary. Not only does this approach go easy on the remote server, it also goes easy on your server that's running the RSS generator. This is only fitting, considering that the whole point of an RSS file is to bring people to the content they're interested in, as efficiently as possible, from the points of view of the people and of the web servers involved.
__END__
Sean M. Burke (<sburke@cpan.org>) is a long-time contributor to CPAN, and is the author of Perl & LWP from O'Reilly & Associates.