HTML Hacking with Regular Expressions - The Perl Journal, Spring 1996

Tom Christiansen

These days it sometimes seems as though you can't even glance at a billboard or a TV commercial without seeing a URL. When a programmer wants to create, update, or analyze an HTML document in an automated fashion, Perl is the obvious tool they turn to. And no small wonder: nothing else comes close to Perl's utility and functionality for text munging. Modern releases of Perl offer significant improvements in regexp handling over initial releases: both stingy and greedy matching, look-ahead assertions and negations, national character support, and fine-grain control of multiline matching all give added power, while an æsthetically pleasing (or at least, not unpalatable) free-format layout renders what would have otherwise been write-only gobbledygook into something that's easy to read, write, and maintain. But regular expressions can still be daunting to beginner and expert alike, so here are a few pointers about how to construct patterns for handling some of the more common HTML-related issues.

Manipulating Links

...Finding Links

Probably the task most frequently attempted on an HTML document is extracting all the links. To do this in all but the most esoteric cases, the following moderately elaborate regular expression will suffice.

#!/usr/bin/perl -n -00
while ( /<\s*A\s+HREF\s*=\s*(["'])(.*?)\1.*?>/gi ) {
    	print "$2\n";
}

Now, that's getting a bit hard to read. To fix this, just add the /x flag to your match and then indent the code however you like with an eye toward legibility. If you're running Perl 5.002 or better, you might also add comments to your pattern to give a blow-by-blow description of what you're doing. (Actually, sometimes you could do so in 5.000 and 5.001, but there were some compiler bugs that could make the comments confuse Perl.) By processing the input stream a paragraph at a time, you let the match spread over the whole paragraph; this is because sometimes there are newlines within tags, although seldom duplicate adjacent newlines. We use Perl's stingy matching feature of .*? to avoid .* expanding to more than we'd like.

#!/usr/bin/perl -n00
while ( m{          # match repeatedly with /g
   < \s* A          # this is an anchor
        \s+ HREF    # a link spec
        \s* = \s*   # here comes the link
        ( ["'] )    # either quote, saved in $1
                    # and \1
       ( .*? )      # the whole link, saved in $2
         \1         # the original $1 quote
       .*? >        # the rest of the tag 
         }xsgi)     # /x for expanded patterns
                    # /s so . can match \n
                    # /g to get multiple hits 
                    # in one paragraph
                    # /i for case insensitivity
                    # on A and HREF
{
      print "$2\n";
}

...Changing Links

An activity related to link extraction is altering links. Perhaps the host name or path has changed. Perhaps links that used to read

   <A HREF= "http://foo.com/somewhere/file">

should now read

   <A HREF= "http://www.foo.com/elsewhere/file">

instead. You'd like to perform the substitution on all the links, but nowhere else in the document. If you don't mind editing the whole file (and not just links), then a very simple substitution would suffice:

perl -pi.bak -e 's,http://foo\.com/somewhere/,http://www.foo.com/elsewhere/,g'

However, if you really care about modifying only legitimate links, you'll have to combine the last two programs. We'll use the -i flag to perform an in-place edit of the file, creating a backup with a ".bak" extension:

#!/usr/bin/perl -p -i.bak -00
s[ 	 
    	(
        < \s* A 
          \s+ HREF 
         	 \s* = \s* 
           	 ( ["'] ) 
   	 )
	    http://foo\.com/somewhere/ 
   	 (
       ( .*? ) 
        		 \2 
     .*? > 
   	 )
 ][${1}http://www.foo.com/elsewhere/$2]xsgi;

This isn't perfect, though, because it matches both "SOMEWHERE" and "somewhere". It also has some difficulties with HTML comments.

But see below for a comment stripping program.

...Plaintext to HTML

What if you have a text file that contains URL references, and you'd like to fix them up to point to the proper place? You need a way to detect the URLs and highlight them properly. That way something like

http://host.com

becomes

A problem: what to do if the URL has trailing punctuation? Consider something like ftp://host/path.file. Is that last dot supposed to be in the URL? We can probably just assume that a trailing dot doesn't count, but even so, most scanners seem to get this wrong. Here's a different approach. We'll store pieces of the pattern in variables for easier reading. (But we had better remember the /o, or else the pattern will run very slowly.) To get around the problem of trailing punctuation that might or might not be part of the URL, we'll give it our best guess by using one of Perl's look-ahead pattern assertions: (?= stuff ).

#!/usr/bin/perl 
# urlify 

require 5.002; 
# well, or 5.000 if you strip the comments
 
$urls = '(' . join ('|', qw{
                            http
                            telnet
                            gopher
                            file
                            wais
                            ftp
                           } 
                   ) 
      . ')';
 
$ltrs = '\w';
$gunk = '/#~:.?+=&%@!\-';
$punc = '.:?\-';
$any = "${ltrs}${gunk}${punc}";
 
while (<>) {
# use this if early-ish perl5 (pre 5.002)
# s{\b(${urls}:[$any]+?)(?=[$punc]*[^$any]|\Z)}
#                   {<A HREF="$1">$1</A>}goi;
s{
   \b           # start at word boundary
   (            # begin $1 
     $urls :    #  need resource and a colon
     [$any] +?  #  followed by one or more
                #  of any valid character, but
                #  be conservative and take
                #  only what you need to....
   )            # end $1 
   (?=          # a look-ahead, 
                #  non-consumptive assertion
       [$punc]* # either 0 or more punctuation
       [^$any]  #  followed by a non-url char
     |          # or else
        $       # the end of the string
   )
 }{<A HREF="$1">$1</A>}igox;
 print;
}

In this code, we guessed at the resource types (gopher, html, etc). It's probably safer just to put in something like \w+ instead, just in case new resource types someday appear which you'd like to handle.

Extracting Titles and Headers

Another common HTML-related task is extracting the title and/or headers from an HTML document. The first try might look something like

perl -lne 'print $1 if m:<TITLE>(.*)</TITLE>:'

This says: print out anything between a <TITLE> and its closing </TITLE> tag. When you run it on a simple test case, it works just fine. But then you try to run it over your whole doc tree, and you find that it misformats or entirely misses some of your HTML files. The first gotcha is that HTML tags aren't case sensitive, so you'd need a /i on the pattern match. The next one is that HTML tag contents can extend across line boundaries.

Enabling multiline matching isn't enough to fix this: you also have to read in a multiline record. Setting the input record separator variable to the empty string (orundef) takes care of this, but you also have to use /s so that the pattern is treated as a single line and allows dots to match newlines.

From the command line, that would be:

perl -00 -lne 'print $1 if m:<TITLE>(.*)</TITLE>:si'

or from a script:

#!/usr/bin/perl -00 -ln
print $1 if m:<TITLE>(.*)</TITLE>:si;

or with more of an awk mindset:


#!/usr/bin/perl -n
BEGIN { ($/, $›) = ("", "\n") }
print $1 if m:<TITLE>(.*)</TITLE>:si;

or for increased readability:

#!/usr/bin/perl 
use English;
$RS = '';

while ($paragraph = <ARGV>) {
    	if ( $paragraph =~ m:<TITLE>(.*)</TITLE>:si ) {
        print "$1\n";
    	} 
}

This all works fine until you come across the odd document with blank lines in the title, or extra fields in the <TITLE TAG="stuff"> line, or even more than one title. So you end up having to embellish your pattern until it becomes increasingly hard to understand and maintain for those who come after you.

Blech.

This brings us to the /x flag on pattern matching. By allowing embedded white space (and even comments as of Perl 5.002), you can tremendously improve legibility and thus maintainability. Here's a full-blown solution to printing out all the titles in all the files on the command line, or STDIN if none are given.

#!/usr/bin/perl -w

require 5.002;	 
# or 5.001 if you remove the comments!

use strict; 
undef $/;		
@ARGV = ('-') unless @ARGV;
my($title, $filename);
while ($filename = shift) { 
    	unless (open(HTML, $filename)) {
   	     warn "can't open $filename: $!";
       	 next;
    	}
    	my $html = <HTML>;
    	my $count = 0;
    	while ( $html =~ m{			 
        	 < \s* TITLE .*? >	        # begin tag
          	 \s* (.*?) \s* 	                    # contents
        	 < \s* / \s* TITLE .*? >	    # end tag
               			}gsix ) {
       if ($count++) {
         warn "$filename has $count titles!\n";
      	 } 
      	 ($title = $1 || "<UNTITLED>") =~ s/\s+/ /g;
      	 write;
    	} 
		}

format STDOUT =
@<<<<<<<<<<<<<<<<<<<<<<< ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$filename,               $title
                         ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
~~                       $title
.

Related to the task of printing out the title (or titles) in a document is dumping out the HTML outline:<H1>, <H2>, and so on. The approach is similar to the title-extraction program above. We'll assume that $html contains the entire document.

while ( $html =~ m{ 
            < \s* H (\d) \b .*? > 
              \s*     (.*?) \s* 
            < \s* / \s* H \1 \b .*? > 
       }gsix ) 
  {
      my($level, $contents) = ($1, $2);
      for ($contents) {
          s/<.*?>//g;            # any extra tags 
          s/\s+/ /g;           	 # newlines and tabs
      } 
      print "$level.", " " x $level, $contents, "\n";
  }

Reducing to Plaintext

Stripping
Stripping <TAGS>
Converting &entities; into what they should be.

This is complicated by the horrible specification for HTML comments: they can have embedded tags in them. The HTML specification inherits SGML's bizarre notion of comments, which can confuse even the most careful program that's trying to deal with them. For example:

<!DOCTYPE  HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"
       	 -- This is an annoying comment > --
>

Notice how that unquoted closing angle bracket doesn't actually end the tag? That's because it's in a comment. This is really annoying! Here's what makes up an HTML comment:

Comments begin with a "<!" and are followed by 0 or more comments; after that, each comment starts with a "--" and includes all text up to and including the next occurrence of "--". They may have trailing while space (albeit not leading white space). You can even have non-comment text up to that final ">" character.

Keeping this all straight requires more than a little care. Perhaps such a thing would be more easily coded using a tool like lex, but Perl's regexps are up to it if you give them a little help, with the /e flag.

Here's a complete program to reduce an HTML doc to plaintext in three easy subsitutions:

#!/usr/bin/perl -p0777
# striphtml ("striff tummel")

# how to strip out html comments and
# tags and transform entities in just
# three - count 'em three - 
# substitutions; sed and awk eat your 
# heart out. :-)

# as always, translations from this
# nacri rendition into more
# characteristically marine,
# herpetoid, titillative, or 
# indonesian idioms are welcome for
# the furthering of comparative
# cyberlinguistic studies.

require 5.001; 
# for nifty embedded regexp comments

# first we'll shoot all the 
# <!-- comments -->

s{ <!     # comments begin with '<!'
          # followed by 0 or more
          # comments;
   (.*?)  # this eats up comments
          # in non random places
   (      # not supposed to have any 
          # whitespace here
          # just a quick start:
   --     # each comment starts with
          # a '--'
     .*?  # and includes all text up
          # to and including the
   --     # next occurrence.  
     \s*  # and may have trailing
          # whitespace (but not
          # leading whitespace)
   )+     # repetire ad libitum
   (.*?)  # trailing non comment text
  >       # up to a '>'
}{
  if ($1 || $3) { # this silliness for 
                  # embedded comments in tags
     "<!$1 $3>";
  } 
}gsex;       # mutate into nada, nothing, 
             # and niente
 

# next we'll remove all the <tags>

s{ <            # opening angle bracket
                #
   (?:          # Non-backreffing grouping
                #                        paren
       [^>'"] * # 0 or more things that are 
                #       neither > nor ' nor "
         |      # or else
        ".*?"   # a section between 
                #  double quotes (stingy match)
         |      # or else
        '.*?'   # a section between 
                #  single quotes (stingy match)
   )+           # repetire ad libitum
                # hm...are null tags (<>) 
                #                   legal?
  >             # closing angle bracket
}{}gsx;         # mutate into nada, nothing, 
                # and niente


# finally we'll translate all &valid; HTML 2.0 
# entities
 
s{ (
    &           # an entity starts with a 
                # semicolon
    ( 
      \x23\d+   # and is either a pound 
                # (# == hex 23) and numbers
       |        # or else
      \w+       # has alphanumunders...
    ) 
   ;?           # a semicolon terminates, 
                # as does anything else
  )
} {
    $entity{$2} # if it's a known entity, 
                # use that. 
        ||      # But otherwise
        $1      # leave what we'd found. 
}gex;           # execute replacement - that's 
                # code not a string


# but wait! load up the %entity mappings
# enwrapped in a BEGIN so that we only 
# execute once since we're in a -p "loop". 
# awk is kinda nice after all.
BEGIN {

 %entity = (
    lt     => '<', 
    gt     => '>', 
    amp    => '&', 
    quot   => '"',     # "  (vertical double quote) 
    nbsp   => chr 160, #    (space)
    iexcl  => chr 161, # ¡
    cent   => chr 162, # ¢
    pound  => chr 163, # £ 
    curren => chr 164, # ¤
    yen    => chr 165, # ¥
    brvbar => chr 166, # ¦  (broken vertical bar)
    sect   => chr 167, # §
    uml    => chr 168, # ¨  (umlaut, or dieresis)
    copy   => chr 169, # ©
    ordf   => chr 170, # ª  (feminine ordinal)
    laquo  => chr 171, # «
    not    => chr 172, # ¬ 
    shy    => chr 173, #   (soft hyphen)
    reg    => chr 174, # ®
    macr   => chr 175, # ¯
    deg    => chr 176, # °
    plusmn => chr 177, # ±
    sup2   => chr 178, # ²  (superscript two)
    sup3   => chr 179, # ³  (superscript three)
    acute  => chr 180, # ´  (acute accent)
    micro  => chr 181, # µ  (micro sign)
    para   => chr 182, # ¶  (pilcrow)
    middot => chr 183, # ·
    cedil  => chr 184, # ¸  (cedilla)
    sup1   => chr 185, # ¹  (superscript one)
    ordm   => chr 186, # º  (masculine ordinal)
    raquo  => chr 187, # » 
    frac14 => chr 188, # ¼  (one-quarter)
    frac12 => chr 189, # ½  (one-half)
    frac34 => chr 190, # ¾  (three-quarters)
    iquest => chr 191, # ¿
    Agrave => chr 192, # À
    Aacute => chr 193, # Á
    Acirc  => chr 194, # Â
    Atilde => chr 195, # Ã
    Auml   => chr 196, # Ä
    Aring  => chr 197, # Å
    AElig  => chr 198, # Æ
    Ccedil => chr 199, # Ç
    Egrave => chr 200, # È
    Eacute => chr 201, # É
    Ecirc  => chr 202, # Ê
    Euml   => chr 203, # Ë
    Igrave => chr 204, # Ì
    Iacute => chr 205, # Í
    Icirc  => chr 206, # Î
    Iuml   => chr 207, # Ï
    ETH    => chr 208, # Ð  (capital Eth, Icelandic)
    Ntilde => chr 209, # Ñ
    Ograve => chr 210, # Ò
    Oacute => chr 211, # Ó
    Ocirc  => chr 212, # Ô
    Otilde => chr 213, # Õ 
    Ouml   => chr 214, # Ö
    times  => chr 215, # ×
    Oslash => chr 216, # Ø
    Ugrave => chr 217, # Ù
    Uacute => chr 218, # Ú
    Ucirc  => chr 219, # Û
    Uuml   => chr 220, # Ü
    Yacute => chr 221, # Ý  (capital Y, acute accent)
    THORN  => chr 222, #Þ   (capital THORN, Icelandic)
    szlig  => chr 223, # ß
    agrave => chr 224, # à
    aacute => chr 225, # á
    acirc  => chr 226, # â
    atilde => chr 227, # ã
    auml   => chr 228, # ä
    aring  => chr 229, # å
    aelig  => chr 230, # æ
    ccedil => chr 231, # ç
    egrave => chr 232, # è
    eacute => chr 233, # é
    ecirc  => chr 234, # ê
    euml   => chr 235, # ë
    igrave => chr 236, # ì
    iacute => chr 237, # í
    icirc  => chr 238, # î
    iuml   => chr 239, # ï
    eth    => chr 240, # ð  (small eth, Icelandic)
    ntilde => chr 241, # ñ
    ograve => chr 242, # ò
    oacute => chr 243, # ó
    ocirc  => chr 244, # ô
    otilde => chr 245, # õ
    ouml   => chr 246, # ö
    divide => chr 247, # ÷ 
    oslash => chr 248, # ø
    ugrave => chr 249, # ù
    uacute => chr 250, # ú
    ucirc  => chr 251, # û
    uuml   => chr 252, # ü
    yacute => chr 253, # ý  (small y, acute)
    thorn  => chr 254, # þ  (small thorn, Icelandic)
    yuml   => chr 255, # ÿ
);


# now fill in all the numbers to match
# themselves
 
    foreach $chr ( 0 .. 255 ) { 
        $entity{ '#' . $chr } = chr $chr;
    }
}

Don't Reinvent the Wheel

By now, you've probably decided two things: first, that Perl's regular expressions can handle nearly any text-munging task you'd ever want to do to an HTML file, and second, that to catch all the fringe cases you have to be extremely careful and not a little clever. Fortunately, code has already been written and tested to perform all of the most common tasks, including parsing HTML files, decoding URLs, escaping special characters, testing remote HTTP references, and much, much more. It's all included in the libwww library, a set of Perl modules written by Martijn Koster and Gisle Aas. With the functions in these modules, many of the tasks above can be automated with solid code hidden away in a module. And not only can you, for example, extract all the links, you can actually make sure that they point to valid documents. For the latest version of this library, check your nearest CPAN site.

__END__

TABLE OF CONTENTS