Parsing HTML with HTML::PARSER - The Perl Journal, Spring 2000

Ken MacFarlane

Packages Used:

HTML::Parser...............................CPAN

Perl is often used to manipulate the HTML files constituting web pages. For instance, one common task is removing tags from an HTML file to extract the plain text. Many solutions for such tasks usually use regular expressions, which often end up complicated, unattractive, and incomplete (or wrong). The alternative, described here, is to use the HTML::Parser module available on the CPAN (http://www.perl.com/CPAN). HTML::Parser is an excellent example of what Sean Burke noted earlier in this issue: some object-oriented modules require extra explanation for casual users.

HTML::Parser works by scanning HTML input, and breaks it up into segments by how the text would be interpreted by a browser. For instance, this input:

<A HREF="index.html">This is a link</A>

would be broken up into three segments: a start tag (<A HREF="index.html">), text (This is a link), and an end tag (</A>). As each segment is detected, the parser passes it to an appropriate subroutine. There's a subroutine for start tags, one for end tags, and another for plain text. There are subroutines for comments and declarations as well.

In this article, I'll first give a simple example on how to read and print out all the information found by HTML::Parser. Next, I'll demonstrate differences in the events triggered by the parser. Finally, I'll show how to access specific information passed along by the parser.

As of this writing, there are two major versions of HTML::Parser available. Both version 2 and version 3 work by having you subclass the module. For this article, I will mostly concentrate on the subclassing method, because it will work with both major versions, and is a bit easier to understand for those not overly familiar with some of Perl's finer details. In version 3, there is more of an emphasis on the use of references, anonymous subroutines, and similar topics; advanced users who may be interested will see a brief example at the end of this article.

Getting Started

The first thing to be aware of when using HTML::Parser is that, unlike other modules, it appears to do absolutely nothing. When I first attempted to use this module, I used code similar to this:

    #!/usr/bin/perl -w

    use strict;
    use HTML::Parser;

    my $p = new HTML::Parser;
    $p->parse_file("index.html");

No output whatsoever. If you look at the source code to the module, you'll see why:

sub text
{
# my($self, $text) = @_;
}

sub declaration
{
# my($self, $decl) = @_;
}

sub comment
{
# my($self, $comment) = @_;
}

sub start
{
# my($self, $tag, $attr, $attrseq, $origtext) = @_;
# $attr is reference to a HASH, $attrseq is reference 
			to an ARRAY
}

sub end
{
# my($self, $tag, $origtext) = @_;
}

The whole idea of the parser is that as it chugs along through the HTML, it calls these subroutines whenever it finds an appropriate snippet (start tag, end tag, and so on). However, these subroutines do nothing. My program works, and the HTML is being parsed — but I never instructed the program to do anything with the parse results.

The Identity Parser

The following is an example of how HTML::Parser can be subclassed, and its methods overridden, to produce meaningful output. This example simply prints out the original HTML file, unmodified:

     1  #!/usr/bin/perl -w
     2  
     3  use strict;
     4  
     5  # define the subclass
     6  package IdentityParse;
     7  use base "HTML::Parser";
     8  
     9  sub text {
    10      my ($self, $text) = @_;
    11      # just print out the original text
    12      print $text;
    13  }
    14  
    15  sub comment {
    16      my ($self, $comment) = @_;
    17      # print out original text with comment marker
    18      print "";
    19  }
    20  
    21  sub start {
    22      my ($self, $tag, $attr, $attrseq, $origtext) = @_;
    23      # print out original text
    24      print $origtext;
    25  }
    26  
    27  sub end {
    28      my ($self, $tag, $origtext) = @_;
    29      # print out original text
    30      print $origtext;
    31  }
    32  
    33  my $p = new IdentityParse;
    34  $p->parse_file("index.html");

Lines 6 and 7 declare the IdentityParse package, having it inherit from HTML::Parser. (Type perldoc perltoot for more information on inheritance.) We then override the text(), comment(), start(), and end() subroutines so that they print their original values. The result is a script which reads an HTML file, parses it, and prints it to standard output in its original form.

The HTML Tag Stripper

Our next example strips all the tags from the HTML file and prints just the text:

     1  #!/usr/bin/perl -w
     2  
     3  use strict;
     4  
     5  package HTMLStrip;
     6  use base "HTML::Parser";
     7  
     8  sub text {
     9      my ($self, $text) = @_;
    10      print $text;
    11  }
    12  
    13  my $p = new HTMLStrip;
    14  # parse line-by-line, rather than the whole 
				file at once
    15  while (<>) {
    16      $p->parse($_);
    17  }
    18  # flush and parse remaining unparsed HTML
    19  $p->eof;

Since we're only interested in the text and HTML tags, we override only the text() subroutine. Also note that in lines 13-17, we invoke the parse() method instead of parse_file(). This lets us read files provided on the command line. When using parse() instead of parse_file(), we must also call the eof() method (line 19); this is done to check and clear HTML::Parser's internal buffer.

Another Example: HTML Summaries

Suppose you've hand-crafted your own search engine for your web site, and you want to be able to generate summaries for each hit. You could use the HTML::Summary module described in TPJ #16, but we'll describe a simpler solution here. We'll assume that some (but not all) of your site's pages use a META tag to describe the content:

<META NAME="DESCRIPTION" CONTENT="description of file">

When a page has a META tag, your search engine should use the CONTENT for the summary. Otherwise, the summary should be the first H1 tag if one exists. And if that fails, we'll use the TITLE. Our third example generates such a summary:

 
 1  #!/usr/bin/perl -w
 2  
 3  use strict;
 4  
 5  package GetSummary;
 6  use base "HTML::Parser";
 7  
 8  my $meta_contents;
 9  my $h1 =    "";
10  my $title = "";
11  
12  # set state flags
13  my $h1_flag    = 0;
14  my $title_flag = 0;
15  
16  sub start {
17      my ($self, $tag, $attr, $attrseq, $origtext) = @_;
18  
19      if ($tag =~ /^meta$/i && 
		$attr->{'name'} =~ /^description$/i) {
20          # set if we find <META NAME="DESCRIPTION"
21          $meta_contents = $attr->{'content'};
22      } elsif ($tag =~ /^h1$/i && ! $h1) {
23          # set state if we find <H1> or <TITLE>
24          $h1_flag = 1;
25      } elsif ($tag =~ /^title$/i && ! $title) {
26          $title_flag = 1;
27      }
28  }
29  
30  sub text {
31      my ($self, $text) = @_;
32      # If we're in <H1>...</H1> or 
			<TITLE>...</TITLE>, save text    
33      if ($h1_flag)    { $h1    .= $text; } 
34      if ($title_flag) { $title .= $text; }
35  }
36  
37  sub end {
38      my ($self, $tag, $origtext) = @_;
39  
40      # reset appropriate flag if we see </H1> or </TITLE>
41      if ($tag =~ /^h1$/i)    { $h1_flag = 0; }
42      if ($tag =~ /^title$/i) { $h1_flag = 0; }
43  }
44   
45  my $p = new GetSummary;
46  while (<>) {
47      $p->parse($_);
48  }
49  $p->eof;
50  
51  print "Summary information: ", $meta_contents || 
52      $h1 || $title || "No summary information found.", "\n";

The magic happens in lines 19-27. The variable $attr contains a reference to a hash where the tag attributes are represented with key/value pairs. The keys are lowercased by the module, which is a code-saver; otherwise, we'd need to check for all casing possibilities (name, NAME, Name, and so on).

Lines 19-21 check to see if the current tag is a META tag and has a field NAME set to DESCRIPTION; if so, the variable $meta_contents is set to the value of the CONTENT field. Lines 22-27 likewise check for an H1 or TITLE tag. In these cases, the information we want is in the text between the start and end tags, and not the tag itself. Furthermore, when the text subroutine is called, it has no way of knowing which tags (if any) its text is between. This is why we set a flag in start() (where the tag name is known) and check the flag in text() (where it isn't). Lines 22 and 25 also check whether or not $h1 and $title have been set; since we only want the first match, subsequent matches are ignored.

Another Fictional Example

Suppose that your company has been running a successful product site at http://www.bar.com/foo/. However, the web marketing team decides that http://foo.bar.com/ looks better in the company's advertising materials, so a redirect is set up from the new address to the old.

Fast forward to Friday, 4:45 in the afternoon, when the phone rings. The frantic voice on the other end says, "foo.bar.com just crashed! We need to change all the links back to the old location!" Just when you though a simple search-and-replace would suffice, the voice adds: "And marketing says we can't change the text of the web pages, only the links."

"No problem", you respond, and quickly hack together a program that changes the links in A HREF tags, and nowhere else.

     1  #!/usr/bin/perl -w -i.bak
     2  
     3  use strict;
     4  
     5  package ChangeLinks;
     6  use base "HTML::Parser";
     7  
     8  sub start {
     9      my ($self, $tag, $attr, $attrseq, $origtext) = @_;
    10  
    11      # we're only interested in changing <A ...> tags
    12      unless ($tag =~ /^a$/) {
    13          print $origtext;
    14          return;
    15      }
    16  
    17      if (defined $attr->{'href'}) {
    18          $attr->{'href'} =~ 
		s[foo\.bar\.com][www\.bar\.com/foo];
    19      }
    20  
    21      print "<A ";
    22      # print each attribute of the <A ...> tag
    23      foreach my $i (@$attrseq) {
    24          print $i, qq(="$attr->{$i}" );
    25      }
    26      print ">";
    27  }
    28  
    29  sub text {
    30      my ($self, $text) = @_;
    31      print $text;
    32  }
    33  
    34  sub comment {
    35      my ($self, $comment) = @_;
    36      print "<!—", $comment, "—>";
    37  }
    38  
    39  sub end {
    40      my ($self, $tag, $origtext) = @_;
    41      print $origtext;
    42  }
    43  
    44  my $p = new ChangeLinks;
    45  while (<>) {
    46      $p->parse($_);
    47  }
    48  $p->eof;

Line 1 specifies that the files will be edited in place, with the original files being renamed with a .bak extension. The real fun is in the start() subroutine, lines 8-27. First, in lines 12-15, we check for an A tag; if that's not what we have, we simply return the original tag. Lines 17-19 check for the HREF and make the desired substitution.

$attrseq appears in line 23. This variable is a reference to an array with the tag attributes in their original order of appearance. If the attribute order needs to be preserved, this array is necessary to reconstruct the original order, since the hash $attr will jumble them up. Here, we dereference $attrseq and then recreate each tag. The attribute names will appear lowercase regardless of how they originally appeared. If you'd prefer uppercase, change the first $i in line 24 to uc($i).

Using HTML::Parser Version 3

Version 3 of the module provides more flexibility in how the handlers are invoked. One big change is that you no longer have to use subclassing; rather, event handlers can be specified when the HTML::Parser constructor is called. The following example is equivalent to the previous program but uses some of the version 3 features:

     1  #!/usr/bin/perl -w -i.bak
     2  
     3  use strict;
     4  use HTML::Parser;
     5  
     6  # specify events here rather than in a subclass
     7  my $p = HTML::Parser->new( api_version => 3,
     8           start_h     => [\amp;start,
     9	"tagname, attr, attrseq, text"],
    10           default_h   => [sub { print shift }, "text"],
    11                             );
    12  sub start {
    13      my ($tag, $attr, $attrseq, $origtext) = @_;
    14  
    15      unless ($tag =~ /^a$/) {
    16          print $origtext;
    17          return;
    18      }
    19  
    20      if (defined $attr->{'href'}) {
    21          $attr->{'href'} =~ 
			s[foo\.bar\.com][www\.bar\.com/foo];
    22      }
    23  
    24      print "<A ";
    25      foreach my $i (@$attrseq) {
    26          print $i, qq(="$attr->{$i}" );
    27      }
    28      print ">";
    29  }
    30  
    31  while (<>) {
    32      $p->parse($_);
    33  }
    34  $p->eof;

The key changes are in lines 7-10. In line 8, we specify that the start event is to be handled by the start() subroutine. Another key change is line 10; version 3 of HTML::Parser supports the notion of a default handler. In the previous example, we needed to specify separate handlers for text, end tags, and comments; here, we use default_h() as a catch-all. This turns out to be a code saver as well.

Take a closer look at line 9, and compare it to line 9 of the previous example. Note that $self hasn't been passed. In version 3 of HTML::Parser, the list of attributes which can be passed along to the handler subroutine is configurable. If our program only needed to use the tag name and text, we can change the string tagname, attr, attrseq, text to tagname, text and then change the start() subroutine to only use two parameters. Also, handlers are not limited to subroutines. If we changed the default handler like this, the text that would have been printed is instead pushed onto @lines.

  my $p = HTML::Parser->new( api_version => 3,
                     start_h     => [\&start,
                     "tagname, attr, attrseq, text"],
                     default_h   => \@lines, "text"],
                         );

Version 3 of HTML::Parser also adds some new features; notably, one can now set options to recognize and act upon XML constructs, such as <TAG/> and <?TAG?>. There are also multiple methods of accessing tag information, instead of the $attr hash. Rather than go into further detail, I encourage you to explore the flexibility and power of this module on your own.

Acknowledgments

The HTML::Parser module was written by Gisle Aas and Michael A. Chase. Excerpts of code and documentation from the module are used here with the authors' permission.

_ _END_ _

Ken MacFarlane (ksm+tpj@universal.dca.net) is a consultant with Info Systems, Inc. in Wilmington, Delaware. When not working or doing other things Perl, Ken and his wife Amanda spend most of their time chasing around their one year old son, Timothy.

TABLE OF CONTENTS