Previous 

Table of Contents  Next 

Five Quick Hacks: Downloading Web Pages

Jon Orwant and Dan Gruhl

Modules Used

LWP, Text::Wrap CPAN

Sure, the web has all kinds of wonderful graphical services. Sit down in front of your computer, go clicky-clicky, and worlds of information are at your fingertips. The problem is, sometimes it's nice not to have to sit in front of a web browser to visit sites. Maybe you'd prefer to have those web pages mailed to you--call it poor man's push technology. Or maybe you'd like to download a lot of information from a huge number of web pages, and you don't want to open them all one by one. Or maybe you'd like to write a robot that scours the web for information. Enter the LWP bundle (sometimes called libwww-perl), which contains two modules that can download web pages for you: LWP::Simple and LWP::UserAgent.

My colleague Dan Gruhl submitted five tiny but exquisite programs to TPJ, all using LWP to automatically download information from a web service. Instead of sprinkling these around the magazine as "TPJ One-Liners", I've collected all five here with a bit of explanation for each.

The first thing to notice is that all five programs look alike. Each uses an LWP module (LWP::Simple in the first three, LWP::UserAgent in the last two) to store the HTML from a web page in Perl's default scalar variable $_. Then they use a series of s/// substitutions to discard the extraneous HTML. The remaining text--the part we're interested in--is displayed on the screen, although it could nearly as easily have been sent as email with the various Mail modules on the CPAN.

Downloading currency exchange rates

The currency program converts money from one currency into another, using the exchange rates on www.oanda.com. Here's how to find out what $15.82 in worth in Euros:

$ currency 15.82 USD EUR
--> 15.82 US Dollar = 14.1452 Euro

The LWP::Simple module has a function that makes retrieving web pages easy: get(). When given a URL, get() returns the text of that web page as one long string. In currency, get() is fed a URL for oanda.com containing the three arguments provided to the program: $ARGV[0], $ARGV[1], and $ARGV[2], which correspond to 15.82, USD, and EUR in the sample run above. The resulting web page is stored in $_, after which four s/// substitutions discard unwanted data.

#!/usr/bin/perl -w

# Currency converter.
# Usage: currency [amount] [from curr] [to curr]

use LWP::Simple;

$_=get("http://www.oanda.com/converter/classic?value=
                   $ARGV[0]&exch=$ARGV[1]&expr=$ARGV[2]");
s/^.*<!-- conversion result starts//s;
s/<!-- conversion result ends.*$//s;
s/<[^>]+>//g; s/[ \n]+/ /gs;
print $_, "\n";

The first s/// removes all text before the HTML comment <!-- conversion result starts; the tail of that comment (-->) becomes the arrow that you see in the output. The second s/// removes all text after the conversion result. The third s/// dumbly removes all tags in the text that remains, and the final s/// replaces consecutive spaces and newlines with a single space each.

Downloading Weather Information

Weather information is downloaded from www.intellicast.com in much the same way as currency information is downloaded from www.oanda.com. The URL is different, the s/// substitutions are different (except for the HTML tag remover), but the basic operation is the same. As an added treat, weather.pl uses the Text::Wrap module to format the output to 76 columns (reformatted for TPJ):

$ weather bos
WHDH WEATHER FORECAST

Wednesday February 17, 1999 at 8:11 AM
---------------------------------------
A stormier pattern is definitely in the making. Two 
significant storms will impact our area during the 
next 5 days. The first one will be bring rain at 
first Wednesday night and especially Thursday, but 
as colder air works down from the higher levels of
the atmosphere, a mix with or change to heavy wet 
snow may occur before that storm is done with us 
later Thursday or Thursday night. However, enormous 
potential rests with the second storm. As of now, 
it would appear the time table could fall in the 
Saturday afternoon into early Sunday time frame. The
rain/snow line is sure to be a challenge with that 
storm, but at this storm it appears that most of 
our area will receive mainly snow, with Cape Cod 
and the Islands falling closest to the rain/snow
line. Please stay tuned to our updates all week long.
Todd and Harv

Todd and Harv say more, but I've truncated their output to save space.

#!/usr/bin/perl

# Prints the weather for a given airport code
#
# Examples: weather bos
#           weather sfo

use LWP::Simple;
use Text::Wrap;

$_ = get("http://www.intellicast.com/weather/$ARGV[0]/
            content.shtml");
s/^.*<BLOCKQUOTE>//s;
s/<\/BLOCKQUOTE>//s;
s/<[^>]+>//g;
s/\n\n\n+/\n\n/g;
s/\©.*//s;

print wrap('', '', $_);   # default: 76 columns

Downloading News Stories

The CNN home page displays the top news story; cnn formats and displays it using Text::Wrap. I sandwiched Dan's code in a while loop that sleeps for five minutes (300 seconds) and retrieves the top story again. If the new story (as usual, stored in $_) is different than the old story ($old), it's printed.

#!/usr/bin/perl

use LWP::Simple;
use Text::Wrap;

while (sleep 300) {
    $_ = get("http://www.cnn.com");
    s/^.*Top Table//s;
    s/<[^>]+>//g;
    s/FULL STORY.*$//s;
    s/^.*>\s+//s;
    s/\n\n+/\n\n/g;
    if ($old ne $_) { print wrap('', '', $_); $old = $_ }
}

Because sleep returns the number of seconds slept, sleep 300 will always return a true value, and so this while loop will never exit.

Completing U.S. Postal Addresses

There's a TPJ subscriber in Cambridge who hasn't been getting his issues. When each issue goes to press, I FTP my mailing list to a professional mail house that takes care of all the presorting and bagging and labeling that the US Post Office requires--an improvement over the days when I addressed every issue myself in a cloud of Glu-Stik vapors.

The problem is that whether I like it or not, the mail house fixes addresses that seem incorrect. 'Albequerque' becomes 'Albuquerque', and 'Somervile' becomes 'Somerville'. That's great, as long as the rules for correcting addresses--developed by the post office--work. They usually do, but occasionally a correct address is "fixed" to an incorrect address. That's what happened to this subscriber.

The address program pretends to be a user typing information into the fields of the post office's address correction page at http://www.usps.com/ncsc/. That page asks for six fields: company (left blank for residential addresses), urbanization (valid only for Puerto Rico), street, city, and zip. You need to provide the street, and either the zip or the city and state. Regardless of which information you provide, the site responds with everything:

$ address company "The Perl Journal" urbanization ""
           street "Boxx 54" city "" state "" zip "02101"

PO BOX 54
BOSTON MA 02101-0054
Carrier Route : B001
County : SUFFOLK
Delivery Point : 54
Check Digit : 8

Note that I deliberately inserted a spelling error: Boxx.

One inconvenience of address is that you have to supply placeholders for all the fields, even the ones you're leaving blank.

This program is a bit trickier than the three you've seen so far. It doesn't use LWP::Simple, but instead two other modules from the LWP bundle: LWP::UserAgent and HTTP::Request::Common. That's because LWP::Simple can handle only HTTP GET queries. This web site uses a POST query, and so Dan used the more sophisticated LWP::UserAgent module, which has an object oriented interface.

First, a LWP::UserAgent object, $ua, is created with new(), and then its request() method is invoked to POST the address data to the web page. If the POST was successful, the is_success() method returns true, and the page contents can then be found in the _content attribute of the response object, $resp. The address is extracted as the _content is being stored in $_, and two more s/// substitutions remove unneeded data.

#!/usr/bin/perl -w
# Need *either* state *or* zip

use LWP::UserAgent;
use HTTP::Request::Common;

$ua = new LWP::UserAgent;
$resp = $ua->request(
         POST 'http://www.usps.com/cgi-bin/zip4/zip4inq',
         [@ARGV]);

exit -1 unless $resp->is_success;
($_ = $resp->{_content}) =~ s/^.*address is:<p>\n//si;
s/Version .*$//s;
s/<[^>]+>//g;
print;

You can use address to determine the zip code given an address, or to find out your own nine-digit zip code, or even to find out who's on the same mail carrier route as you. If you type in the address of the White House, you'll learn that the First Lady has her own zip code, 20500-0002.

Downloading Stock Quotes

Salomon Smith Barney's web site is one of many with free 15-minute delayed stock quotes. To find the stock price for Yahoo, you'd provide stock with its ticker symbol, yhoo:

$ stock.pl yhoo
Yahoo Inc
Symbol: YHOO
Last Price: $134 11/16 at 9:39am
Chg.: +1 5/16
Bid: $135 1/16
Ask: $135 1/8

Like address, stock needs the LWP::UserAgent module because it's making a POST query.

Just because LWP::UserAgent has an OO interface doesn't mean that the program has to spend an entire line creating an object and explicitly storing it ($object = new Class), although that was undoubtedly what Gisle Aas envisioned when he wrote the interface. Here, Dan's preoccupation with brevity shows, as he invokes an object's method in the same statement that creates the object: (new LWP::UserAgent)->request(...).

#!/usr/bin/perl

# Pulls a stock quote from Salomon Smith Barney's web site.
#
# Usage:       stock.pl ibm
#
# or whatever stock ticker symbol you like.

use LWP::UserAgent;
use HTTP::Request::Common;

$response = (new LWP::UserAgent)->request(POST
      'http://www.smithbarney.com/cgi-bin/benchopen/quoteget',
              [ search_type => "1",
              search_string => "$ARGV[0]" ]);

exit -1 unless $response->is_success;
$_ = $response->{_content};
s/<[^>]+>//g;
s/^.*recent close[^a-zA-Z0-9]+//s;
@t = split(/\n\n+/);
print shift(@t), "\n";
@h = split(/\n/, shift(@t));
foreach (@h){
    ($f = shift(@t)) =~ s/\n//g;
    next unless $f =~ /\d/; # Skip fields without digits
    print $_, ": ", $f, " ";
    print " at ", shift(@t) if /L/;
    print "\n";
}

Whee

These aren't robust programs. They were dashed off in a couple of minutes for one person's pleasure, and they most certainly will break as the companies in charge of these pages change the web pages formats or the URLs needed to access them.

We don't care. When that happens, these scripts will break, we'll notice that, and we'll amend them accordingly. Sure, each of these programs could be made much more flexible. They could be primed to adapt to changes in the HTML, the way a human would if the information were moved around on the web page. Then the s/// expressions would fail, and the programs could expend some effort trying to understand the HTML using a more intelligent parsing scheme, perhaps using the HTML::Parse or Parse::RecDescent modules. If the URL became invalid, the scripts might start at the site home page and pretend to be a naive user looking for his weather or news or stock fix. A smart enough script could start at Yahoo and follow links until it found what it was looking for, but so far no one has been smart enough to write a script like that.

Of course, the time needed to create and test such programs would be much longer than making quick, brittle, and incremental changes to the code already written. No, it's not rocket science--it's not even computer science--but it gets the job done.

__END__


Jon Orwant and Dan Gruhl are members of the MIT Media Laboratory Electronic Publishing Group. When Jon isn't creating TPJ, he's writing Perl programs that write Perl programs that play games. Dan writes programs that hide messages in paper money and search for meaning in large text databases.


Previous 

Table of Contents  Next