Scanning HTML - The Perl Journal, Fall 2000

Sean M. Burke

Packages Used
HTML::TreeBuilder..........................................................................CPAN HTML::Parser..................................................................................CPAN HTML::Element................................................................................CPAN

In TPJ #17, Ken MacFarlane's article Parsing HTML with HTML::Parser describes how the HTML::Parser module scans HTML source as a stream of start tags, end tags, text, comments, and so on. In TPJ #18, my Trees article kicked around the idea of tree-shaped data structures. Now I'll tie it together by discussing trees of HTML.

The CPAN module HTML::TreeBuilder takes the tags that HTML::Parser extracts, and builds a parse tree -- a tree-shaped network of objects representing the structured content of an HTML document.(And if you need a quick explanation of objects, see my TPJ #17 article A User's View of Object-Oriented Modules, or go whole hog and get Damian Conway’s excellent book Object-Oriented Perl, from Manning Publications.) Once the document is parsed as a tree, you'll find the common tasks of extracting data from that HTML document/tree to be quite straightforward.

HTML::Parser, HTML::TreeBuilder, and HTML::Element

HTML::TreeBuilder can construct a parse tree out of an HTML source file simply by saying:

  use HTML::TreeBuilder;
  my $tree = HTML::TreeBuilder->new();
  $tree->parse_file('foo.html');

$tree now contains a parse tree built from the HTML in foo.html. The parse tree is represented as a network of objects -- $tree is the root, an element with tag name html. Its children typically include head and body elements, and so on. Each element in the tree is an object of the class HTML::Element.

So, if you take this source:

  <html><head><title>Doc 1</title></head>
  <body>
  Stuff <hr> 2000-08-17
  </body></html>

and feed it to HTML::TreeBuilder, it'll return a tree of objects that looks like this:

               html
             /      \
         head        body
        /          /  |  \
     title    "Stuff"  hr  "2000-08-17"
       |
    "Doc 1"

This is a pretty simple document. If it were any more complex, it'd be a bit hard to draw in that style, since it sprawls left and right. The same tree can be represented a bit more easily sideways, with indenting:


  • html
     • head
        • title
           • "Doc 1"
     • body
        • "Stuff"
        • hr
        • "2000-08-17"

Both representations express the same structure. The root node is an object of the class HTML::Element,(Actually, the root is of the class HTML::TreeBuilder, but that's just a subclass of HTML::Element, plus a few extra methods like parse_file.) with the tag name html, and with two children: an HTML::Element object whose tag names are head and body. And each of those elements have children, and so on down. Not all elements have children -- the hr element doesn't, for instance. And not all nodes in the tree are elements -- the text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.

Objects of the class HTML::Element have three noteworthy attributes:

_tag, best accessed as $element->tag. The element's tag name, lowercased (e.g., em for an EM element).(Yes, this is misnamed. In proper SGML lingo, this is instead called a GI (short for "generic identifier") and the term "tag" is used for a token of SGML source that represents either the start of an element (a start tag like "<em lang='fr'>") or the end of an element (an end tag like "</em>"). However, since more people claim to have been abducted by aliens than to have ever seen the SGML standard, and since both encounters typically involve a feeling of "missing time", it's not surprising that the terminology of the SGML standard is not closely followed.)
_parent, best accessed as $element->parent. The element that is the element's parent, or undef if this element is the root.
_content, best accessed as $element->content_list. The list of nodes (i.e., elements or text segments) that are the element's children.

Moreover, if an element has any attributes, those are readable as $element->attr('name') -- for example, with the object built from <a id='foo'>bar</a>, the method call $element->attr('id') returns the string foo. Furthermore, $element->tag on that object returns the string a, $element->content_list returns a list consisting of just the single scalar bar, and $element->parent method returns the parent of this node -- which might be, for example, a <p> element.

And that's all that there is to it -- you throw HTML source at TreeBuilder, and it returns a tree of HTML::Element objects and some text strings.

However, what do you do with a tree of objects? People code information into HTML trees not for the fun of arranging elements, but to represent the structure of specific text and images -- some text is in this li element, some other text is in that heading, some images are in this table cell with those attributes, and so on.

Now, it may happen that you're rendering that whole HTML tree into some layout format. Or you could be trying to make some systematic change to the HTML tree before dumping it out as HTML source again. But in my experience, the most common programming task that Perl programmers face with HTML is trying to extract some piece of information from a larger document. Since that's so common (and also since it involves concepts required for more complex tasks), that is what the rest of this article will be about.

Scanning HTML trees

Suppose you have a thousand HTML documents, each of them a press release. They all start out:

[...lots of leading images and junk...]
<h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in
charge of world conquest, Rock Feldspar, announced today the
opening of a new office in Ougadougou, the capital city of
Burkina Faso, gateway to the bustling "Silicon Sahara" of
Africa...
[...etc...]

What you've got to do is: For each document, copy whatever text is in the h1 element, so that you can make a table of its contents. Now, there are three ways to do this:

• You can just use a regex to scan the file for a text pattern.

For simple tasks, this will be fine. Many HTML documents are, in practice, very consistently formatted with respect to placement of linebreaks and whitespace, so you could just get away with scanning the file like so:
  sub get_heading {
      my $filename = $_[0];
      local *HTML;
      open(HTML, $filename)
        or die "Couldn't open $filename);
      my $heading;
    Line:
      while (<HTML>) {
          if( m{<h1>(.*?)</h1>}i ) {
              $heading = $1;
              last Line;
          }
      }
      close(HTML);
      warn "No heading in $filename?"
        unless defined $heading;
      return $heading;
  }
This is quick, fast, and fragile -- if there's a newline in the middle of a heading's text, it won't match the above regex, and you'll get an error. The regex will also fail if the h1 element's start tag has any attributes. If you have to adapt your code to fit more kinds of start tags, you'll end up basically reinventing part of HTML::Parser, at which point you should probably just stop and use HTML::Parser itself.

• You can use HTML::Parser to scan the file for an h1 start tag token and capture all the text tokens until the h1 end tag. This approach is extensively covered in Ken MacFarlane's TPJ #17 article Parsing HTML with HTML::Parser. (A variant of this approach is to use HTML::TokeParser, which presents a different and handier interface to the tokens that HTML::Parser extracts.)

Using HTML::Parser is less fragile than our first approach, since it is insensitive to the exact internal formatting of the start tag (much less whether it's split across two lines). However, when you need more information about the context of the h1 element, or if you're having to deal with tricky HTML bits like tables, you'll find that the flat list of tokens returned by HTML::Parser isn't immediately useful. To get something useful out of those tokens, you'll need to write code that knows which elements take no content (as with hr elements), and that </p> end tags are optional, so a <p> ends any currently open paragraph. You're well on your way to pointlessly reinventing much of the code in HTML::TreeBuilder,(And, as the person who last rewrote that module, I can attest that it wasn't terribly easy to get right! Never underestimate the perversity of people creating HTML.) at which point you should probably just stop and use HTML::TreeBuilder itself.

• You can use HTML::Treebuilder and scan the tree of elements it creates.

The last approach, using HTML::TreeBuilder, is diametrically opposed to the first approach, which involves just elementary Perl and one regex. The TreeBuilder approach involves being comfortable with the concept of tree-shaped data structures and modules with object-oriented interfaces, as well as with the particular interfaces that HTML::TreeBuilder and HTML::Element provide.

However, the TreeBuilder approach is the most robust, because it involves dealing with HTML in its "native" format -- the tree structure that HTML code represents, without any consideration of how the source is coded and with what tags are omitted.

So, to extract the text from the h1 elements of an HTML document:

  sub get_heading {
      my $tree = HTML::TreeBuilder->new;
      $tree->parse_file($_[0]);
      my $heading;
      my $h1 = $tree->look_down('_tag', 'h1');
      if ($h1) {
          $heading = $h1->as_text;
      } else {
          warn "No heading in $_[0]?";
      }
      $tree->delete;     # clear memory
      return $heading;
  }

This uses some unfamiliar methods. The parse_file method that we've seen before builds a tree based on source from the file given. The delete method is for marking a tree's contents as available for garbage collection when you're done. The as_text method returns a string that contains all the text bits that are children (or otherwise descendants) of the given node -- to get the text content of the $h1 object, we could just say:

  $heading = join '', $h1->content_list;

but that will work only if we're sure that the h1 element's children will be only text bits. If the document contained:

  <h1>Local Man Sees <cite>Blade</cite> Again</h1>

then the sub-tree would be:

  • h1
    • "Local Man Sees "
    • cite
      • "Blade"
    • " Again'

so join '', $h1->content_list will result in something like this:

  Local Man Sees HTML::Element=HASH(0x15424040) Again

Meanwhile, $h1->as_text would yield:

  Local Man Sees Blade Again

Depending on what you're doing with the heading text, you might want the as_HTML method instead. It returns the sub-tree represented as HTML source. $h1->as_HTML would yield:

  <h1>Local Man Sees <cite>Blade</cite> Again</h1>

However, if you wanted the contents of $h1 as HTML, but not the $h1 itself, you could say:

  join '',
    map(
      ref($_) ? $_->as_HTML : $_,
      $h1->content_list
    )

This map iterates over the nodes in $h1's list of children, and for each node that's just a text bit (as "Local Man Sees " is), it just passes through that string value, and for each node that's an actual object (causing ref to be true), as_HTML will be used instead of the string value of the object itself (which would be something quite useless, as most object values are). So for the cite element, as_HTML will be the string <cite>Blade</cite>. And then, finally, join just combines all the strings that the map returns into one string.

Finally, the most important method in our get_heading subroutine is the look_down method. This method looks down at the sub-tree starting at the given object (here, $h1), retrieving elements that meet criteria you provide.

The criteria are specified in the method's argument list. Each criterion consists of two scalars: a key and a value expressing an element and attribute. The key might be _tag or src, and the value might be an attribute like h1. Or, the criterion can be a reference to a subroutine that, when called on an element, returns true if it's a node you're looking for. If you specify several criteria, that means you want all the elements that satisfy all the criteria. (In other words, there's an implicit "and.")

And finally, there's a bit of an optimization -- if you call the look_down method in a scalar context, you get just the first node (or undef if none) -- and, in fact, once look_down finds that first matching element, it doesn't bother looking any further.

So the example:

  $h1 = $tree->look_down('_tag', 'h1');

returns the first element at or under $tree whose "_tag" attribute has the value h1.

Complex Criteria in Tree Scanning

Now, the above look_down code looks like a lot of bother, with barely more benefit than just grepping the file! But consider a situation in which your criteria are more complicated -- suppose you found that some of your press releases had several h1 elements, possibly before or after the one you actually want. For example:

  <h1><center>Visit Our Corporate Partner
   <br><a href="/dyna/clickthru">
       <img src="/dyna/vend_ad"></a>
  </center></h1>
  <h1><center>ConGlomCo President Schreck to Visit Regional HQ
   <br><a href="/photos/Schreck_visit_large.jpg">
       <img src="/photos/Schreck_visit.jpg"></a>
  </center></h1>

Here, you want to ignore the first h1 element because it contains an ad, and you want the text from the second h1. The problem is how to formalize what's an ad and what's not. Since ad banners are always entreating you to "visit" the sponsoring site, you could exclude h1 elements that contain the word "visit" under them:

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      $_[0]->as_text !~ m/\bvisit/i
    }
  );

The first criterion looks for h1 elements, and the second criterion limits those to only the ones whose text doesn't match m/\bvisit/. Unfortunately, that won't work for our example, since the second h1 mentions "ConGlomCo President Schreck to Visit Regional HQ".

Instead, you could try looking for the first h1 element that doesn't contain an image:

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      not $_[0]->look_down('_tag', 'img')
    }
  );

This criterion subroutine might seem a bit odd, since it calls look_down as part of a larger look_down operation, but that's fine. Note if there's no matching element at or under the given element, look_down returns false (specifically, undef) in a boolean context. If there are matching elements, it returns the first. So this means "return true only if this element has no img element as descendants and isn't an img element itself."

  sub {
    not $_[0]->look_down('_tag', 'img')
  }

This correctly filters out the first h1 that contains the ad, but it also incorrectly filters out the second h1 that contains a non-advertisement photo near the headline text you want.

There clearly are detectable differences between the first and second h1 elements -- the only second one contains the string "Schreck", and we can just test for that:

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      $_[0]->as_text =~ m{Schreck}
    }
  );

And that works fine for this one example, but unless all thousand of your press releases have "Schreck" in the headline, it's not generic enough. However, if all the ads in h1s involve a link whose URL involves "/dyna/", you can use that:

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      my $link = $_[0]->look_down('_tag','a');
      # no link means it's fine
      return 1 unless $link;

      # a link to there is bad
      return 0 if $link->attr('href') =~ m{/dyna/};

      return 1;   # otherwise okay
    }
  );

Or you can look at it another way, and say that you want the first h1 element that either contains no images, or else whose image has a src attribute whose value contains "/photos/":

  my $real_h1 = $tree->look_down(
    '_tag', 'h1',
    sub {
      my $img = $_[0]->look_down('_tag','img');

      # no image means it's fine
      return 1 unless $img;

      # good if a photo
      return 1 if $img->attr('src') =~ m{/photos/};

      return 0; # otherwise bad
    }
  );

Recall that this use of look_down in a scalar context returns the first element at or under $tree matching all the criteria. But if you can formulate criteria that match several possible h1 elements, with the last one being the one you want, you can use look_down in a list context, and ignore all but the last element of the returned list:

  my @h1s = $tree->look_down(
    '_tag', 'h1',
    ...maybe more criteria...
  );
  die "What, no h1s here?" unless @h1s;
  my $real_h1 = $h1s[-1];  # last or only element

A Case Study: Scanning Yahoo! News

The above (somewhat contrived) case involves extracting data from a bunch of pre-existing HTML files. In such situations, it's easy to know when your code works, since the data it handles won't change or grow, and you typically need to run the program only once.

The other kind of situation faced in many data extraction tasks is where the program is used recurringly to handle new data -- such as from ever-changing Web pages. As a real-world example of this, consider a program that you could use to extract headline links from subsections of Yahoo! News (http://dailynews.yahoo.com/). Yahoo! News has several subsections, such as:

• http://dailynews.yahoo.com/h/tc/ for technology news
• http://dailynews.yahoo.com/h/sc/ for science news
• http://dailynews.yahoo.com/h/hl/ for health news
• http://dailynews.yahoo.com/h/wl/ for world news
• http://dailynews.yahoo.com/h/en/ for entertainment news

All of them are built on the same basic HTML template -- and a scarily complicated template it is, especially when you look at it with an eye toward identifying the real headline links and screening out the links to everything else. You'll need to puzzle over the HTML source, and scrutinize the output of $tree->dump on the parse tree of that HTML.

Sometimes the only way to pin down what you're after is by position in the tree. For example, headlines of interest may be in the third column of the second row of the second table element in a page:

  my $table = ( $tree->look_down('_tag','table') )[1];
  my $row2  = ( $table->look_down('_tag', 'tr' ) )[1];
  my $col3  = ( $row2->look-down('_tag', 'td')   )[2];
  ...then do things with $col3...

Or they might be all the links in a <p> element with more than two <br> elements as children:

  my $p = $tree->look_down(
    '_tag', 'p',
    sub {
      2 > grep { ref($_) and $_->tag eq 'br' }
              $_[0]->content_list
    }
  );
  @links = $p->look_down('_tag', 'a');

But almost always, you can get away with looking for properties of the thing itself, rather than just looking for contexts. If you're lucky, the document you're looking through has clear semantic tagging, perhaps tailored for CSS (Cascading Style Sheets):

  <a href="...long_news_url..." class="headlinelink">Elvis
  seen in tortilla</a>

If you find anything like that, you could leap right in and select links with:

  @links = $tree->look_down('class', 'headlinelink');

Regrettably, your chances of observing such semantic markup principles in real-life HTML are pretty slim.(In fact, your chances of finding a page that is simply free of HTML errors are even slimmer. And surprisingly, sites like Amazon or Yahoo! are typically worse as far as quality of code than personal sites whose entire production cycle involves simply being saved and uploaded from Netscape Composer.)

The code may be "accidentally semantic", however -- for example, in a set of pages I was scanning recently, I found that looking for td elements with a width attribute value of 375 got me exactly what I wanted. No one designing that page ever conceived of width=375 as meaning "this is a headline", but if you take it to mean that, it works.

An approach like this happens to work for the Yahoo! News code, because the headline links are distinguished by the fact that they (and they alone) contain a b element:

  <a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>

Or, diagrammed as a part of the parse tree:

  • a  [href="...long_news_url..."]
    • b
      • "Elvis seen in tortilla"

A rule that matches these can be formalized as "look for any a element that has only one daughter node, which must be a b element". And this is what it looks like when cooked up as a look_down expression and prefaced with a bit of code to retrieve the Yahoo! News page and feed it to TreeBuilder:

use strict;
use HTML::TreeBuilder 2.97;
use LWP::UserAgent;
sub get_headlines {
    my $url = $_[0] || die "What URL?";
    
    my $response = LWP::UserAgent->new->request(
      HTTP::Request->new( GET => $url )
    );
  
    unless ($response->is_success) {
      warn "Couldn't get $url: ", 
$response->status_line, "\n";
      return;
    }
    
    my $tree = HTML::TreeBuilder->new();
    $tree->parse($response->content);
    $tree->eof;
    
    my @out;
    foreach my $link (
      $tree->look_down(
        '_tag', 'a',
        sub {
          return unless $_[0]->attr('href');
          my @c = $_[0]->content_list;
          @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
        }
     )
    ) {
        push @out, [$link->attr('href'),$link->as_text ];
    }
    
    warn "Odd, fewer than 6 stories in $url!" if @out < 6;
    $tree->delete;
    return @out;
}

And we add a bit of code to call get_headlines and display the results:

foreach my $section (qw[tc sc hl wl en]) {
    my @links = get_headlines(
      "http://dailynews.yahoo.com/h/$section/"
    );
    print
      $section, ": ", scalar(@links), " stories\n",
      map(("  ", $_->[0], " : ", $_->[1], "\n"), @links),
      "\n";
}

Now we have our own headline extractor service! By itself, it isn't so amazingly useful (since if you want to see the headlines, you can just look at the Yahoo! News pages), but it could easily be the basis for features like filtering the headlines for particular topics of interest.

One of these days, Yahoo! News will change its HTML template. When this happens, it will appear to the above program as though there are no links meeting our criteria; or, less likely, dozens of erroneous links will meet the criteria. In either case, the criteria will have to be changed for the new template; they may just need adjustment, or you may need to scrap them and start over.

Regardez, duvet!

It's often a challenge to write criteria that match the desired parts of an HTML parse tree. Very often you can pull it off with a simple $tree->look_down('_tag', 'h1'), but sometimes you have to keep adding and refining criteria, until you end up with complex filters like what I've shown in this article. The benefit of HTML parse trees is that one main search tool, the look_down method, can do most of the work, making simple things easy while keeping hard things possible.

_ _END_ _

Sean M. Burke (sburke@cpan.org) is the current maintainer of HTML::TreeBuilder and HTML::Element, both originally by Gisle Aas. He'd like to thank the folks who listened to him ramble incessantly about HTML::TreeBuilder and HTML::Element at this year's Yet Another Perl Conference and O'Reilly Open Source Software Convention.

TABLE OF CONTENTS