PREVIOUS  TABLE OF CONTENTS  NEXT 

Create Surreal HTML Pages with The Mangler

Lincoln D. Stein

If you've poked around the eg directory in the Perl distribution, you may have noticed a small program called travesty. This program takes any regular text file, processes it, and spews out a curious parody of the original. For example, here's a small part of what you get when you feed it this article:

Travesty achieves this by calling the Perl distribution's eg directory, you may have noticed a small program called travesty. This program takes any regular text file, processes it, and spews out a curious parody of the number of words to generate (lines 81 to 82).

Travesty::regurgitate() returns a parse tree in turn, calling ref() to determine whether the node is any of the tree by returning a value of 0 from the LWP modules, as well as back to Mangler's fill-out form.

Travesty's output is almost, but not quite, English. Reasonable phrases and sometimes whole sentences pop out, but the whole makes no sense at all. However, if you were to analyze the word frequency of the output, you'd find it identical to the original. Furthermore, if you were to count the frequency of word pairs, you'd find them the same as well. Travesty achieves this by using the original text to create a lookup table of word triples (A,B,C), in which C is indexed on the (A,B) pair. After creating this table, it spews out a parody of the text using an process known as a Markov chain: the program chooses a random (A,B) pair and uses the lookup table to pick a C. The new (B,C) pair is now used to look up the fourth word and this process continues ad infinitum.

his article presents the "The Mangler", a CGI script that runs any Web page on the Internet through the travesty program and returns the result. If you like, you can play with it at http://www.genome.wi.mit.edu/~lstein/mangler.cgi.

How It Works

You can see Mangler's entry page below. When the user connects, she's shown a page that prompts her to type in the URL for a web page with text. When she presses the "Mangle" button, the script extracts the text from that page, slices and dices it with the travesty algorithm, and displays the result, shown on the next page.

The Mangler uses routines from the LWP modules, as well as from CGI.pm. Both of these libraries are available from a CPAN archive near you.

Prompting the User

The Mangler uses CGI.pm to parse the CGI parameters and create the fill-out form. We pull in CGI.pm on line 6 and import both the standard and HTML3-specific subroutines. On line 11 we look for a parameter named "mangle." If defined, we call the mangle() subroutine (line 16). Otherwise, we call prompt_for_url(). As an aside, line 11 shows a technique for initializing field values in a fill-out form. Only if the request method is a POST resulting from the user pressing the "Mangle" button, do we actually do the work. Otherwise, if the request method is a GET, we ignore it and let CGI.pm's "sticky" behavior initialize the text field automatically. This allows you to create a default URL for Mangler by creating a link to it like this one:

<A HREF="/cgi-bin/mangler?mangle=http://www.microsoft.com/">Mangle Uncle Bill</A>

The prompt_for_url() routine is defined in lines 92 through 110. It follows the form that should be familiar to readers of my previous columns. Using CGI.pm's fill-out form and HTML shortcuts, we create a form containing a single text field labeled "mangle" and a submit button.

Fetching the Document

The first half of the mangle() subroutine (lines 25-50) does the work of fetching the remote document. As described in last issue's column, we use the LWP::UserAgent library to create an HTTP request and to retrieve the document across the net. Several things may go wrong at this point. For example, the user may have typed in an invalid URL, or the remote server may be down. On line 31 we check the success status of the transfer. If the transfer fails, the subroutine prints out the nature of the error using LWP's statusMessage() subroutine and returns. When the script sees that the subroutine has returned a false value, it regenerates the fill-out form by invoking prompt_for_url() again.

Next, we extract the retrieved document's MIME type from its Content-type header field. We get the field on line 42 by making a call to the LWP::Response header() method. We can only process HTML files, so if the type turns out not to be "text/html" we print an error message and again return false.

If all has gone well so far, we print out a small preamble before the mangled document itself (lines 52-59). The preamble creates a title for the page, a descriptive header, and links to the original document location and to Mangler's fill-out form. One interesting question: How do we ensure that the document's relative URLs and in-line images work properly? We set the document's BASE attribute to the URL of the unmodified document by passing -xbase to the start_html() method in CGI.pm.

Run the Travesty Algorithm

This is the most interesting part of the program. If we were to pipe the retrieved HTML through the travesty generator, it would scramble the tags with the text, creating an illegible mess. We want to mangle the text of the file but leave its HTML structure, including tags and in-line images, intact.

To do this, we use the HTML manipulation routines defined in LWP. On line 61, we call parse_html(), a routine defined in HTML::Parse. This parses the HTML document and returns a parse tree object, which we store in the scalar $parse_tree.

On line 62 we make the first of two calls to the parse tree's traverse() method. This method performs a depth-first traversal of the parse tree, calling the subroutine of our choosing for each element of the tree. In this case we pass it a reference to our swallow() subroutine (lines 68-72). swallow() examines each node in turn and extracts the ones that contain straight text, passing them to the travesty algorithm. There are two types of node to worry about: those that are branches in the tree (tag pairs surrounding content), and those that are leaves (plain text). We can distinguish between branches and leaves by calling Perl's ref() function on the node. If the node is a reference, it's a branch and we return immediately. Otherwise we know that the node is a leaf. We pass its text to the subroutine Travesty::swallow() which breaks up the string into an array of words using split() and adds them to the travesty algorithm's lookup table.

The travesty algorithm itself is defined in the last sixty lines of The Mangler. The code here is a slight modification of the original code in Perl's eg directory and I won't go into the details here. It's worth studying, particularly if you're interested in entering the Obfuscated Perl contest. [Next issue. -ed]

Print the Mangled Document

The last task is to print out the mangled document. In line 61 we make the second call to traverse(), this time passing it a reference to the regurgitate() subroutine (lines 74-87). As before, the subroutine examines each node of the parse tree in turn, calling ref() to determine whether the node is a leaf or a branch. If the node is a branch corresponding to any of the tags <HTML>, <HEAD>, or <BODY> we skip it completely - we've already begun to create the HTML document and we don't need to repeat these sections. Similarly, we skip the entire contents of the HTML head section, by asking the parse tree's isInside() method (line 78) whether the node lies within a <HEAD> tag. If it does, we abort the traversal of this part of the tree by having regurgitate() return 0. Otherwise, we print out the tag, using the node's starttag() and endtag() methods to produce the appropriate opening and closing tags.

Whenever we encounter a leaf node containing text, we pass the number of desired words we'd like ($words) to Travesty::regurgitate() (lines 83 to 84). It returns a series of scrambled words, which we simply print out. That's it!

_ _END_ _


PREVIOUS  TABLE OF CONTENTS  NEXT