Webpluck - The Perl Journal, Fall 1997

Ed Hill

The promises of smart little web agents that run around the web and grab things of interest to you have gone unfulfilled. Like me, you probably have a handful of web pages that you check on a regular basis, and if you had the time, you'd check many more.

Listed below are a few of the bookmarks that I check on a regular basis. Each of these pages has content that changes every day, and it is, of course, the content of these pages that I am interested in - not their layout, nor the advertising that appears on the pages.

Dilbert (of course)                               CNN US News
Astronomy Picture of the Day              C|Net's NEWS.COM
The local paper (The Daily Iowan)       ESPNET Sportszone

These pages are great sources of information. My problem is that I don't have time to check each one every day to see what is there or if the page has been updated. What I want is my own personal newspaper that is built from the sources listed above.

Similar Tools

This is not an original idea, but after spending many hours searching for a tool to do what I wanted, I gave up. Here are the contenders I considered, and why they didn't do quite what I wanted.

First there is the "smart" agent. This little gremlin roams the net trying to guess what you want to see using either some AI technique or other algorithm. Firefly (http://www.firefly.com/) is an example; you tell it you are interested in a particular topic and it points you at a list of sites that others have ranked. When I first looked at Firefly, it suggested that since I was interested in "computers and the internet," I should check out "The Amazing Clickable Beavis"
(http://web.nmsu.edu/~jlillibr/ClickableBeavis/).

This is why I don't have much confidence in agents. Besides, I know what I want to see. I have the URLs in hand. I just don't have the time to go and check all the pages every day.

The second type of technology is the "custom newspaper." There are two basic types. CRAYON ("Create Your Own Newspaper," headquartered at http://www.crayon.net/), is one flavor of personalized newspaper. CRAYON is little more than a page full of links to other pages that change everyday. For me, CRAYON just adds to the problem, listing tons of pages that I wish I had time to check out. I was still stuck clicking through lists of links to visit all the different pages.

Then there are sites like My Yahoo (http://my.yahoo.com/), a single page whose content changes every day. This is very close to what I needed - a single site with all of the information I need. My Yahoo combines resources from a variety of different sources. It shows a one-line summary of an article; if it's something that I find interesting, I can click on the link to read more about it. The only problem with My Yahoo is that it's restricted to a small set of content providers. I want resources other than what Yahoo provides.

Since these tools didn't do exactly what I wanted, I decided to write my own. I figured with Perl, the LWP library, and a weekend, I could throw together exactly what I wanted. Thus webpluck was born. My goal was to write a generic tool that would automatically grab data from any web page and create a personalized newspaper exactly like My Yahoo. I decided the best approach was to define a regular expression for each web page of interest. webpluck uses the LWP library to retrieve the web page, extracts the content with a regular expression tailored to each, and saves it to a local cache for display later. Once it has done this for all the sources, I use a template to generate my personal newspaper.

How To Use Webpluck

I don't want this article to turn into a manual page (since one is provided), but here is a brief summary of how to use webpluck. You first create a configuration file containing a list of targets that define which pages you want to read, and the regular expression to match against the contents of that page. Here is an example of a target definition that retrieves headlines from the CNN US web page.

name     cnn-us 
url      http://www.cnn.com/US/ 
regex    <h2>([^\<]+)<\/h2>.*?<a href=\"([^\"]+)\" 
fields   title:url

These definitions define the following: the name of the file to hold the data retrieved from the web page; the URL of the page (if you point at a page containing frames, you need to determine the URL of the page that actually contains the content); the Perl regular expression used to extract data from the web page; and the names of the fields matched in the regular expression that you just defined. The first pair of parentheses in the regex field matches the first field, the second pair matches the second, and so on. For the configuration shown, ([^\<]+) is tagged as the title and ([^\"]+) is tagged as the url. That url is the link to the actual content, distinct from the url definition on the second line, which is the starting point for the regex.

Running webpluck with the target definition above creates a file called cnn-us in a cache directory that you define. Here's the file from March 25th, 1997:

title:Oklahoma bombing judge to let 'impact witnesses' see trial 
url:http://www.cnn.com/US/9703/25/okc/index.html

title:Simpson's attorneys ask for a new trial and lower damages 
url:http://www.cnn.com/US/9703/25/simpson.newtrial/index.html 

title:U.S. playing low-key role in latest Mideast crisis 
url:http://www.cnn.com/WORLD/9703/25/us.israel/index.html 

title:George Bush parachutes -- just for fun 
url:http://www.cnn.com/US/9703/25/bush.jump.ap/index.html

An OJ story. Who would have guessed?

As you might expect, everything depends on the regular expression, which must be tailored for each source. Not everyone, myself included, feels comfortable with regular expressions; if you want to get the most use out of webpluck, and you feel that your regular expression skills are soft, I suggest buying Jeffrey Friedl's book Mastering Regular Expressions.

The second problem with regular expressions is that as powerful as they are, they can only match data they expect to see. So if the publisher of the web page you are after changes his or her format, you'll have to update your regular expression. webpluck notifies you if it couldn't match anything, which is usually a good indication that the format of the target web page has changed.

Once all the content has been collected, webpluck takes those raw data files and a template file that you provide, and combines them to create your "dynamic" HTML document.

webpluck recognizes a new "tag" in your template file that has been defined called <clip>. It replaces these tags with data from the raw data files that you have created with webpluck. Everything else in the template file is passed through as is. Here is an example of a segment in my daily template file (again using the CNN US headlines as an example).

<clip name="cnn-us"> 
<li><a href="url">title</a> 
</clip>

This is replaced with the following HTML (the lines have been split to make them more readable):

<li><a href="http://www.cnn.com/US/9703/25/okc/index.html">\ 
Oklahoma bombing judge to let 'impact witnesses' see trial</a> 

<li><a href="http://www.cnn.com/US/9703/25/simpson.newtrial/index.html">\ 
Simpson's attorneys ask for a new trial and lower damages</a> 

<li><a href="http://www.cnn.com/WORLD/9703/25/us.israel/index.html">\ 
U.S. playing low-key role in latest Mideast crisis</a> 

<li><a href="http://www.cnn.com/US/9703/25/bush.jump.ap/index.html">\ 
George Bush parachutes -- just for fun </a>

I personally use webpluck by running one cron job every morning and one during lunch to re-create my "daily" page. I realize webpluck could be used for a lot more than this; that's left as an exercise for the reader.

How Webpluck Works

Now on to the technical goodies. For those who don't know what the LWP library is - learn! LWP is a great collection of Perl objects that allows you to fetch documents from the web. What the CGI library does for people writing web server code, LWP does for people writing web client code. You can find out more about LWP at http://www.sn.no/libwww-perl/, and you can download LWP from the CPAN at http://www.perl.com/CPAN/modules/by-module/LWP/.

webpluck is a simple program. Most of the code takes care of processing command line arguments, reading the configuration file, and checking for errors. The guts rely on the LWP library and Perl's powerful regular expressions. The following is part of the main loop in webpluck. I've removed some error checking to make it smaller, but the real guts are shown below.

use LWP; 

$req = HTTP::Request->new( GET => $self->{'url'} ); 
$req->header( Accept => "text/html, */*;q=0.1" ); 
$res = $main::ua->request( $req ); 

if ($res->is_success()) { 
   my (@fields)  = split( ':', $self->{'fields'} ); 
   my $content   = $res->content(); 
   my $regex     = $self->{'regex'}; 

   while ($content =~ /$regex/isg) { 
     my @values = ($1,$2,$3,$4,$5,$6,$7,$8); 

     # URL's are special fields; they might be 
     # relative, so we check for that

     for ($i = 0; $i <= $#fields; $i++) { 

       if ($fields[$i] eq "url") { 
          my $urlobj = new URI::URL($values[$i], $self->{'url'})
          $values[$i] = $urlobj->abs()->as_string();
       } 
       push(@datalist, $fields[$i] . ":" . $values[$i]); 
     } 
     push( @{$self->{'_data'}}, \@datalist ); 
  } 
}

The use LWP imports the LWP module, which takes care of all the web related tasks (fetching documents, parsing URLs, and parsing robot rules). The next three lines are all it takes to grab a web page using LWP.

Assuming webpluck's attempt to retrieve the page is successful, it saves the document as one long string. It then iterates over the string, trying to match the regular expression defined for this target. The following statement merits some scrutiny:

while ( $content =~ /$regex/isg ) {

The /i modifier of the above regular expression indicates that it should be a case-insensitive match. The /s modifier treats the entire document as if it were a single line (treating newlines as whitespace), so your regular expression can span multiple lines. /g allows you to go through the entire document and grab data each time the regular expression is matched, instead of just the first.

For each match webpluck finds, it examines the fields defined by the user. If one of the fields is url (the only field that isn't arbitrary), it's turned into an absolute URL - specifically, a URI::URL object. I let that object translate itself from a relative URL to an absolute URL that can be used outside of the web site from where it was retrieved. This is the only data from the target page that gets massaged.

Lastly I take the field names and the data that corresponds to each field and save that information. Once all the data from each matched regular expression is collected, it's run through some additional error checking and saved to a local file.

The Dark Side Of The Force

Like any tool, webpluck has both good and bad uses. The program is a sort of web robot, which raises some concerns for myself (as the author) and for users. A detailed list of these considerations can be found on The Web Robots Page located at: http://info.webcrawler.com/mak/projects/robots/robots.html. A few points listed in the Web Robot Guide to Etiquette stand out:

Identify yourself. webpluck does identify itself as "webpluck/2.0" to the remote web server. This isn't a problem now since few people use webpluck at the moment, but it could be if sites decide to block my program.

Don't overload a site. Since webpluck only checks a finite set of web pages that you explicitly define (it doesn't tree-walk sites), this isn't a problem. Just to be safe, webpluck pauses for a small time period between retrieving documents. It should only be run once or twice a day - don't launch it every five minutes to ensure that you constantly have the latest and greatest information. If instant notification is critical to you, consider push technologies like Pointcast or Marimba's Castanet.

Obey robot exclusion rules found in /robots.txt. This is the toughest rule to follow. Since webpluck is technically a robot, I should be following the rules set forth by the /robots.txt file. However, since the data that I am after typically changes every day, some sites have set up specific rules telling robots not to index their pages.

In my opinion, webpluck isn't your typical robot. I consider it more like your average web client. I'm not building an index, which I think is the reason that these sites tell robots not to retrieve the pages. If webpluck followed the letter of the law, it wouldn't be very useful since it wouldn't be able to access many pages that change their content. For example, CNN has this in their robot rules file:

     User-agent: * 
     Disallow: /

If webpluck were law-abiding, it wouldn't be able to retrieve any information from CNN, one of the main sites I check for news. So what to do? After reading the Robot Exclusion Standard (http://info.webcrawler.com/mak/projects/robots/norobots.html),
I believe webpluck doesn't cause any of the problems meant to be prevented by the standard. Your interpretation may differ; I encourage you to read it and decide for yourself. webpluck has two options (--naughty and --nice) that instruct it whether to obey the robot exclusion rules found on remote servers. This is my way of deferring the decision to you.

Just playing nice as a web robot is only part of the equation. Another consideration is what you do with the data once you get it. There are obvious copyright considerations. Copyright on the web is a broad issue; I'm just going to mention a few quandaries raised by webpluck. I don't have the answers.

Is it okay to extract the URL from the Cool Site of the Day home page and jump straight to the "Cool Site?" The Cool Site folks don't own the URL, but they would certainly prefer that you visit their site first.

Is it okay to retrieve headlines from CNN? What about URLs for the articles?

How about grabbing the actual articles from the CNN site and redisplaying them with your own layout?

And for all of these tasks, does it matter if they're for your own personal use as opposed to showing it to a friend, or redistributing it more widely?

One newspaper in the Shetland Islands has sued a rival newspaper for setting up links to their web site. The May 1997 issue of WIRED lists another site being sued, called TotalNews (http://www.totalnews.com/). Some of TotalNews' web pages point to news stories written by other news providers - but TotalNews sells its own ads in a frame around them.

Obviously, people have different opinions of what is right and what is wrong. I personally don't have the background, knowledge, or desire to try to tell you what to do. I merely want to raise the issues so you can think about them and make your own decisions.

For a final example of a potential problem, let's take a look at Dilbert. Here's the target I have defined for Dilbert at the time of this writing.

name     dilbert 
url      http://www.unitedmedia.com/comics/dilbert/ 
regex    SRC=\"?([^>]?\/comics\/dilbert\/archive.*?\.gif)\"?\s+
fields   url

The cartoon on the Dilbert page changes every day, and instead of just having a link to the latest cartoon (todays-dilbert.gif), they generate a new URL everyday and include the cartoon in their web page. They do this because they don't want people setting up links directly to the cartoon. They want people to read their main page - after all, that's where the advertising is. Every morning I find out where today's Dilbert cartoon is located, bypassing all of United Media's advertising. If enough people do this, United Media will probably initiate countermeasures. There are at least three things that would prevent webpluck (as it currently works) from allowing me to go directly to today's comic.

They could write a CGI program that you have to access to get at the comic strip. This CGI program can take all kinds of steps to try and see if you should have access to the image (e.g. checking Referer headers, or planting a cookie on you). But I think that any such countermeasure can be worked around with a clever enough webpluck.
They can embed the advertising in the same image as the cartoon. That'll work for Dilbert, but not for other pages where the content is plain HTML.
They can move away from HTML to some other display format such as Coda (http://www.randomnoise.com/). Coda takes over an entire web page with a single view in which Java-based UI elements and text are displayed inside. This approach makes the information supplied by such a page totally useless to any web robot, and for that matter anyone without a Java-based browser. People seem to like Coda anyway.

Most funding for web technology exists to solve the needs of content providers, not users. If tools like webpluck are considered a serious problem by content providers, steps will be taken to shut them down, or make them less useful.

It isn't my intent to distribute a tool to filter web advertising or steal information from web pages so that I can redistribute it myself, but I'm not so na�ve to think that this can't be done. Obviously, anyone intent on doing these things can do so; webpluck just makes it easier. Do what you think is right.

You can find more information about webpluck at http://strobe.weeg.uiowa.edu/~edhill/public/webpluck/. The program is also on the TPJ web site.

_ _END_ _

Ed Hill (ed-hill@uiowa.edu) is a Unix Systems Administrator at the University of Iowa

TABLE OF CONTENTS