CS-Web: A Lightweight Summarizer for HTML - The Perl Journal, Winter 1999

Tony Rose and Ave Wrigley

Packages Used

HTML::Summary                                  CPAN
HTML::TreeBuilder                              CPAN
Text::Sentence.          Part of HTML::Summary

Canon, like many other large companies, is a multi-national organization with multiple (26) web sites, each managed by a different part of the company. This is a problem for the typical Canon customer, who knows nothing about Canon's internal organization and simply wants to find information about their cameras or download a new printer driver. They need a single clear way to find what they want.

CS-Web: a search engine for Canon's web space

In 1997 we wrote CS-Web, a set of Perl programs to collect information from all of Canon's web sites, index it, and make it searchable from the web. We wrote our own solution because at the time the available products were either tools designed for searching the entire web (such as AltaVista), or for searching a single web site.

CS-Web consists of a robot, a database, and a web interface (written in mod_perl). The robot traverses all of Canon web sites and stores a description of each page in the database. The search engine queries the database and gives you a list of candidate documents, and their descriptions. You can try CS-Web for yourself: it is linked from the main "gateway" page for Canon (http://www.canon.com/). Alternatively, you can access it directly at http://csweb.cre.canon.co.uk/.

CS-Web presented a variety of challenges, many of which would make suitable war stories for TPJ. However, for this article, we will focus on one crucial problem: generating the summary of an HTML document.

META tags

Unlike some other search engines, CS-Web doesn't index the full text of the document. Instead, it indexes document descriptions. When web page authors use the META tag's NAME and CONTENT attributes to encode information about the document, CS-Web will use it. However, when that information isn't provided, CS-Web tries to boil down the text of the document into a summary.

One important limitation on a document's summary is length: each web page corresponds to a row in a relational database table, and the size of each field in each row is fixed in advance. This is because fixed-width fields are much quicker to search than variable-width fields. You'll see later how this length constraint introduced its own problems.

If we were deploying CS-Web across a lot of web sites, we'd have quickly found that very few web page authors consistently provide accurate metadata. In fact, deliberately misleading metadata is often included by unscrupulous authors to enhance the page's prominence in search engine results.

However, the outlook for CS-Web was a little more promising. Since Canon's webmasters are generally working together, we could expect a certain level of integrity, and assume that they were not trying to deceive the CS-Web robot. In turn, the CS-Web robot could acknowledge this trust: if it found a page description within a META tag, it would accept the tag as legitimate.

So much for metadata, but how could we generate a text description from the raw HTML when no metadata are present? By encoding some natural language processing techniques in Perl. The result was HTML::Summary.

HTML::Summary

HTML::Summary is available from the CPAN (ftp://ftp.perl.org/pub/CPAN/authors/id/A/AW/AWRIGLEY/); version 0.016 is described in this article. First, you create an HTML::Summary object, using the new() method. You can provide configuration parameters as arguments to new():

    my $html_summarizer = new HTML::Summary LENGTH => 200;

The LENGTH parameter is the maximum length in bytes for the generated summary. Next, you need an HTML::Element object corresponding to the HTML page that you want to summarize. You can generate one with HTML::TreeBuilder:

    my $html_tree = new HTML::TreeBuilder;
    $html_tree->parse( $html_document );

$html_document is a string containing the HTML of the web page; this could have been read in from a file, or returned as the contents of an HTTP request, such as through LWP::Simple's get() method.

Finally, you call the generate() method of the HTML::Summary object, with the HTML::Element object as an argument. That returns the summary of the page as a string:

    $html_summary = $html_summarizer->generate( $html_tree );

That's how you use it. But how does it work?

The Summarization Algorithm

One of the main tasks before us was generating a good abstract of fixed length from arbitrary text. This is known to be a important and difficult problem, and a quality solution requires sophisticated natural language techniques that can analyze the structure of the original text, identify key phrases and concepts, and regenerate these in a more succinct format.

Luckily for us, there are some quick and dirty ways to generate summaries. We only needed to provide a gist of the original for someone browsing the CS-Web search results. In addition, for retrieval purposes, we wanted the summary to contain representative keywords.

One advantage that we had over people trying to generate summaries from plain text is that HTML pages contain markup information --the HTML tags. Markup tells us about the structure of the content, and often about its relative importance as well. For example, it is usually clear in HTML pages where paragraphs begin and end, and when important text is italicized, emboldened, or made into a heading.

The HTML::Summary module uses a technique known as the location method of text summarization. This consists of identifying important sentences (based primarily on their location in the text), and concatenating them together to produce an abstract. A simple example of this would be to take the first sentence of every paragraph in an article and string them together. This can sometimes be surprisingly effective:

   Canon, like many other large companies, is a multi-national
   organization with multiple (26) web sites, each managed by a 
   different part of the company. In 1997 we wrote CS-Web, a 
   set of Perl programs to collect information from all of 
   Canon's web sites, index it, and make it searchable from the 
   web. CS-Web consists of a robot, a database, and a web 
   interface (written in mod_perl). CS-Web presented a variety 
   of challenges, many of which would make suitable war stories 
   for TPJ.

The text summarization method used in HTML::Summary is an adaptation of the location method. It works as follows:

• Split into sentences

First, the text is split into sentences. (More about this later.)

• Score the sentences

The sentences are scored according to what HTML element they appear in, and whether or not they are the first sentence in that element. The algorithm here is pretty simple: each element has a score. The first sentence in that element gets this score; the rest of the sentences get nothing.

• Sort the sentences by score

The sentences are stored in an array of hashes. Each hash corresponds to a sentence, and contains information about the text in the sentence, its length, the HTML element it appeared in, the score given to it, and its original order in the text.

    $summary[ scalar( @summary ) ] = {
        'text'          => $text,
        'length'        => length( $text ),
        'tag'           => $tag,
        'score'         => $score,
        'order'         => scalar( @summary ),
    };

The scores, as described above, are based on the HTML element that the sentences appear in. These scores are stored in a global hash:

    my %ELEMENT_SCORES = (
        'p'         => 100,
        'h1'        => 90,
        'h2'        => 80,
        'h3'        => 70,
    );

These scores were arrived at by empirical investigation; we have no theoretical justification for them.

• Truncate the list of sentences

Calculate how many sentences are needed to exceed the requested summary length.

• Sort the sentences by original order again

Having remembered the original sentence order in the text in the hash for that sentence, we can now re-sort the sentences in that order.

• Concatenate the sentences to create the summary

Spaces are added between the sentences, since whitespace was stripped when the sentences were split.

• Truncate the summary at the requested length

This last step assumes that if you want a summary of 200 characters, 201 characters are not acceptable . even if it means chopping the summary off mid-sentence. This is what we wanted in CS-Web. Maybe in other applications a less severe approach would be appropriate . it's easy to add more options to HTML::Summary, so let us know what you think.

Sentence Splitting

Now for the nitty gritty. The remainder of this article focuses on just one aspect of the HTML::Summary: splitting the element contents into sentences. Japanese character encodings were a particular problem for CS-Web; our approach is described in a sidebar: Truncating Japanese Text.

The task of splitting text into sentences seemed like a more general problem than its application to text summarization, so this is contained in a separate module: Text::Sentence. For the moment, this is distributed as part of the HTML::Summary package, but we would be interested to learn if there is any interest in using this module independently.

Text::Sentence is basically just a regex. It is has a non-object oriented interface that exports one function, split_sentences(), that takes the text to be split into sentences as an argument, and returns a list of the sentences.

  sub split_sentences
  {
      my $text = shift;

      return () unless $text;

The function first checks if there really is any text to split into sentences; if not, it just returns the empty string.

# $capital_letter is a character set; to account for locale, this

# includes all letters for which lc is different from that letter.

  my $capital_letter =  
      '[' . 
          join( '', 
            grep { lc( $_ ) ne ( $_ ) } 
             map { chr( $_ ) } ord( "A" ) .. ord( "\xff" )
          ) . 
      ']'
  ;

Although it would be more efficient to compute this regex component once at the package level, doing it in split_sentences() allows the user to change locales between calls.

The next few lines build up the components of the regex that split the text into sentences. The first of these components is the capital letter found at the start of a sentence. Instead of using the character class [A-Z] as you would normally, Text::Sentence accounts for locale-specific capital letters. For example, in French, a capital A acute (Á) won't be matched by [A-Z]. The method used in Text::Sentence makes use of the fact that lc is sensitive to locale settings, and returns a lowercase version of all capitalized characters. For more information on how Perl handles locales, see the perllocale documentation.

        @PUNCTUATION = ( '\.', '\!', '\?' );

The @PUNCTUATION array is a global variable in Text::Sentence containing any punctuation used to indicate the end of a sentence. The fact that it's a global means that you're encouraged to change it. For example, you might want to add locale specific punctuation for the Spanish . Ą':

    my $html_summarizer = new HTML::Summary LENGTH => 200;
    push( @HTML::Summary::PUNCTUATION, chr( 161 ) );

Back to split_sentences():

  # this needs to be alternation, not a character class, 
  # because of multibyte characters
  my $punctuation = '(?:' . join( '|', @PUNCTUATION ) . ')';

As mentioned above, one of our concerns was dealing with multibyte character encodings (see the sidebar on Truncating Japanese Text). Japanese punctuation characters may be more than one character long. For example, an exclamation point in the EUC Japanese encoding is "\xA1\xAA".

# return $text if there is no punctuation ... return 
$text unless $text =~ /$punctuation/;

If these isn't any end-of-sentence punctuation in the text, then we might as well return the text now.

 my $opt_start_quote = q/['"]?/;
 my $opt_close_quote = q/['"]?/;
 
 # these are distinguished because (eventually!) I would like to do 
 # locale stuff on quote  characters
 
 my $opt_start_bracket = q/[[({]?/; # }{
 my $opt_close_bracket = q/[\])}]?/;

Sentences sometimes have quotation marks or parentheses which come before the capital letter at the beginning, or after the full stop (period, question mark, or exclamation point) at the end. For example, the following sentence:

Larry said "let there be light!" (And there was.)

is two sentences; the first ends after the second double quote. However:

Larry said "let there be light!" (and there was).

is one sentence. Here is the regex in all its glory:


  my @sentences = $text =~ /
  (
                          # sentences start with ...
      $opt_start_quote    # an optional start quote
      $opt_start_bracket  # an optional start bracket
      $capital_letter     # a capital letter ...
      .+?                 # at least some (non-greedy) chars ...
      $punctuation        # ... followed by any one of !?.
      $opt_close_quote    # an optional close quote
      $opt_close_bracket  # and an optional close bracket
  )
  (?=                     # with lookahead that it is followed by ...
      (?:                      # either ...
          \s+                  # some whitespace ...
          $opt_start_quote     # an optional start quote
          $opt_start_bracket   # an optional start bracket
          $capital_letter      # an uppercase word character 
                               # (for locale sensitive matching)
      |                        # or ...
          \n\n                 # a couple (or more) of CRs
                               #(i.e. a new para)
      |                        # or ...
          \s*$                 # optional whitespace and by end of string
      )
  )
  /gxs
  ;
  return @sentences if @sentences;
  return ( $text );
}

This regex makes use of the lookahead feature of regular expressions. In this case, it allows us to specify that a sentence must not only start with a capital letter, and end in a full stop, but that there must be another capital letter which follows the full stop. The only exception to this is when the sentence is at the end of a paragraph.

The lookahead accounts for the whitespace between sentences, so it's not part of the matched patterns that end up in the @sentences array. That's why concatenating the sentences won't give you back the exact original text.

The main problem with trying to split text into sentences is that there are several uses for periods, such as abbreviations.

Dr. Livingstone, I presume.

This phrase counts as two sentences according to Text::Sentence -- the first sentence is three characters long. The performance of Text::Sentence could be improved by taking into account special cases like honorifics (Mr., Mrs., Dr.), common abbreviations (e.g., etc., i.e.), and so on. However, as with many natural language problems, this obeys the law of diminishing returns; a little bit of effort will do a decent 90% job, but that last 10% is pretty difficult. For our purposes the 90% is good enough.

Conclusion

We chose to use Perl for CS-Web because of the obvious benefits: the LWP modules for web programming, DBD/DBI, mod_perl, and so on. We found that Perl is also a useful tool for doing natural language work. Its text processing features, rapid development cycle, and ability to generate complex data structures on the fly make it particularly appropriate.

A lot of interesting work in natural language research involves analyzing corpus data; collecting statistics about language use over large databases of typical usage. The web is an obvious rich source of this type of data, and in view of this, it is a little surprising how few tools and modules appeared to be available in Perl for this field. Certainly, when we were working on the Text::Sentence regex, we posted something to a language processing mailing list, and there seemed to be quite a lot of interest in what we were doing, as well as extensive Perl expertise in that community. Hopefully natural language processing will become yet another nut for Perl to crack!

__END__

Tony Rose has a background in natural language processing and has published widely in the area of speech and language technology. He joined Canon Research Centre in November 1996 and is now manager of the Information Retrieval Department. Prior to this position he worked at Hewlett-Packard Laboratories on language modelling for speech recognition. He then completed a research fellowship at BT Labs on Internet agents for information retrieval. His research interests include natural language interfaces and information retrieval.

Ave Wrigley is webmaster at www.itn.co.uk. He has been developing web applications in perl for the last five years, and is the author of WWW::Sitemap and co-author of WWW::Robot and HTML::Summary. He can be contacted at Ave.Wrigley@itn.co.uk.

TABLE OF CONTENTS