PREVIOUS  TABLE OF CONTENTS  NEXT 

A Web Spider in One Line?

Tkil

URLs

libwww-perl (LWP):           http://www.linpro.no/lwp/
Web spiders (robots):         http://info.webcrawler.com/mak/projects/robots/robots.html

Today, someone on the IRC #perl channel was asking some confused questions. We finally managed to figure out that he was trying to write a web robot, or "spider", in Perl. Which is a grand idea, except that:

  1. Perfectly good spiders have already been written and are freely available at http://info.webcrawler.com/mak/ projects/robots/robots.html.
  2. A Perl-based web spider is probably not an ideal project for a novice Perl programmer. Work your way up to it.

Having said that, I immediately pictured a one-line Perl robot. It wouldn't do much, but it would be amusing. After a few abortive attempts, I ended up with this monster, which requires Perl 5.005. I've split it onto separate lines for easier reading.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '
    $ua = LWP::UserAgent->new; 
    while (my $link = shift @ARGV) { 
        print STDERR "working on $link"; 
        HTML::LinkExtor->new( 
          sub { 
            my ($t, %a) = @_; 
            my @links = map { url($_, $link)->abs() } 
                       grep { defined } @a{qw/href img/}; 
            print STDERR "+ $_" foreach @links;
            push @ARGV, @links;
          } ) -> parse( 
           do { 
               my $r = $ua->simple_request 
                 (HTTP::Request->new("GET", $link)); 
               $r->content_type eq "text/html" ? $r->content : ""; 
           } 
         ) 
     }' http://slinky.scrye.com/~tkil/ 

I actually edited this on a single line; I use shell-mode inside of Emacs, so it wasn't that much of a terror. Here's the one-line version.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe 
'$ua = LWP::UserAgent->new; while (my $link = shift @ARGV) { 
print STDERR "working on $link";HTML::LinkExtor->new( sub
{ my ($t, %a) = @_; my @links = map { url($_, $link)->abs()
} grep { defined } @a{qw/href img/}; print STDERR "+ $_"
foreach @links; push @ARGV, @links} )->parse(do { my $r =
$ua->simple_request (HTTP::Request->new("GET", $link)); 
$r->content_type eq "text/html" ? $r-> content : ""; } )
}' http://slinky.scrye.com/~tkil/ 

After getting an ego-raising chorus of groans from the hapless onlookers in #perl, I thought I'd try to identify some cute things I did with this code that might actually be instructive to TPJ readers.

Callbacks and Closures

Many modules are designed to do grunt work. In this case, HTML::LinkExtor (a specialized version of HTML::Parser) knows how to look through an HTML document and find links. Once it finds them, however, it needs to know what to do with them.

This is where "callbacks" come in. They're well-known in GUI circles, since interfaces need to know what to do when one presses a button or selects a menu item. Here, HTML::LinkExtor needs to know what to do with links (all tags, actually) when it finds them.

My callback is an anonymous subroutine reference:

      sub { 
          my ($t, %a) = @_; 
          my @links = map { url($_, $link)->abs() } 
                        grep { defined } @a{qw/href img/}; 
          print STDERR "+ $_" foreach @links;
          push @ARGV, @links;
      } 

I didn't notice until later that $link is actually scoped just outside of this subroutine (in the while loop), making this subroutine look almost like a closure. It's not a classical closure - it doesn't define its own storage - but it does use a lexical value far away from where it is defined. (Enough justification for a section title!)

Cascading Arrows

It's amusing to note that, aside from debugging output, the while loop consists of a single statement. The arrow operator (->) only cares about the value of the left hand side; this is the heart of the Perl/Tk idiom:
    my $button = $main->Button( ... )->pack(); 

We use a similar approach, except we don't keep a copy of the created reference (which is stored in $button above):

    HTML::LinkExtor->new(...)->parse(...); 

This is a nice shortcut to use whenever you want to create an object for a single use.

Using Modules with One-Liners

From my first thought of this one-liner, I knew I'd be using modules from the libwww-perl (LWP) library. The first few iterations of this "one-liner" used LWP::Simple, which explicitly states that it should be ideal for one-liners. The -M flag is easy to use, and makes many things very easy. LWP::Simple fetched the files just fine. I used something like:
	HTML::LinkExtor->new(...)->parse( get $link ); 

Where get() is a function provided by LWP::Simple; it returns the contents of a given URL.

Unfortunately, I needed to check the Content-Type of the returned data. The first version merrily tried to parse .tar.gz files and got confused:

working on ./dist/irchat/irchat-3.03.tar.gz
Use of uninitialized value at
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 104.
Use of uninitialized value at 
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 107.
Use of uninitialized value at 
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 82. 

Ooops.

Switching to the "industrial strength" LWP::UserAgent module allowed me to check the Content-Type of the fetched page. Using this information, together with the HTTP::Response module and a quick ?: construct, I could parse either the HTML content or an empty string.

The End

Whenever I write a one-liner, I find it interesting to think about it in different ways. While I was writing it, I was mostly thinking from the bottom up; some of the complex nesting is a result of this. For example, the callback routine is fairly hairy, but once I had it written, I could change the data source from LWP::Simple::get() to LWP::UserAgent and HTTP::Request::content() quite easily.

Obviously, this spider does nothing more than visit HTML pages and try to grab all the links off of each one. It could be more polite (but see the LWP::RobotUA module for some of that) and it could be smarter about which links to visit. In particular, there's no sense of which pages have already been visited; a tied DBM of visited pages would solve that nicely.

Even with these limitations, I'm impressed at the power expressed by that "one" line. Kudos for that go to Gisle Aas (the author of LWP) and to Larry Wall, for making a language that does all the boring stuff for us. Thanks Gisle and Larry!

__END__


Tkil can be found at tkil@scrye.com He lives in Fort Collins, Colorado, with a small pod of computers, a wall full of CDs, some neglected juggling toys, a closetful of neuroses, bunches of books, and a string of Christmas lights for illumination. He enjoys playing with Perl, C++, and Unix, and sometimes even manages to get paid for it. The rest of his time is wasted on IRC EFNet's #perl channel.

PREVIOUS  TABLE OF CONTENTS  NEXT