PREVIOUS  TABLE OF CONTENTS  NEXT 

Downloading Web Pages Through A Proxy Server

Rob Svirskas

Modules Used

LWP, Text::Wrap

In TPJ #13 ("Five Quick Hacks: Downloading Web Pages"), Jon Orwant and Dan Gruhl presented five simple but elegant programs that download information from various web services: stock quotes, weather predictions, currency information, U.S. postal address correction, and CNN headline news. If you're like me, your company uses a firewall to repel wily hackers, which means that we have to use a proxy server to access most URLs. A proxy server (sometimes called a "gateway") is simply an intermediary computer that sends your request to a server and returns its response to you. The bad news: If you try to use the LWP::Simple get() function without first letting it know about your proxy server, it returns nothing at all.

The good news: There's a simple way around this. The LWP::Simple module checks an environment variable called http_proxy. If $ENV{http_proxy} contains the name of a computer, your calls to get() use it as a proxy server. You can set envrionment variables in two ways: either by assigning a value to $ENV{http_proxy}, or by using whatever mechanism your shell or operating system provides. For instance, you can define your proxy server under the Unix bash shell as follows:

% export http_proxy=http://proxy.mycompany.com:1080

This makes LWP::Simple route requests through port 1080 of the proxy server proxy.mycompany.com. You may need to use the set or setenv command, depending upon what shell you're using. There are also related environment variables for non-http services: ftp_proxy, gopher_proxy, and wais_proxy. There's also a no_proxy variable, but we'll talk about that in a bit. Since we are using Perl, There's More Than One Way To Do It. We can still access URLs via a proxy without mucking with environment variables if we replace LWP::Simple with LWP::UserAgent and HTTP:Request::Common. Let's look at a version of the currency converter (the first example from TPJ #13) that uses LWP::UserAgent:

The line beginning $ua->proxy defines our proxy server. This routes the user agent's HTTP requests through proxy.mycompany.com. To use a proxy server for multiple protocols, specify them in a list as below:

$ua->proxy(['http','ftp','wais'] =>
            'http://proxy.mycompany.com:1080');

The programs that download the weather report and the CNN top story (the second and third examples from TPJ #13) are equally simple to convert: replace LWP::Simple with LWP::UserAgent and HTTP:Request::Common, and the calls to get() with the user agent code as described above. The U.S. Postal Address program, zip4, already has the UserAgent code -- all we need to do is add the single line of code after the UserAgent has been created:

$ua = new LWP::UserAgent();
$ua->proxy('http','http://proxy.mycompany.com:1080');

Or, if you're into brevity, create the user agent and set its proxy server in one line:

($ua=(new LWP::UserAgent))->proxy('http',
          'http://proxy.mycompany.com:1080');

Most proxy servers will not let you access URLs within your own domain. That's why you often need to use your browser's Preferences menu to identify exceptions, telling your browser which domains to access without using the proxy. Fortunately, we can do that in our programs as well. If you prefer using environment variables:

export no_proxy="mycompany.com"

This will bypass the proxy server for URLs ending in "mycompany.com" (including URLs like www.itsmycompany.com). As you might expect, this can be done in the program instead:

$ua->no_proxy('mycompany.com');

If your program only needed to access web sites inside your firewall, you wouldn't need to declare the proxy server in the first place, so the no_proxy would be superfluous.

__END__

#!/usr/bin/perl -w

# Currency converter.
# Usage: currency [amount] [from curr] [to curr]

use LWP::UserAgent;
use HTTP::Request::Common;

$ua = new LWP::UserAgent();

# Set up your proxy server in the next line.
$ua->proxy('http','http://proxy.mycompany.com:1080');
$resp = $ua->request(GET 
        "http://www.oanda.com/converter/classic?value="
        . "$ARGV[0]&exch=" . uc($ARGV[1]) . 
        "&expr=" . uc($ARGV[2]));

$_ = $resp->{_content};
s/^.*<!-- conversion result starts//s;
s/<!-- conversion result ends.*$//s;
s/<[^>]+>//g;
s/[ \n]+/ /gs;
print $_, "\n";


PREVIOUS  TABLE OF CONTENTS  NEXT