PREVIOUS  TABLE OF CONTENTS  NEXT 

On-the-Fly Plots Made Easy

Lincoln D. Stein

Packages Described:

Gnuplot: http://www.geocities.com/SiliconValley/Foothills/6647/
ftp://ftp.gnuplot.vt.edu/pub/gnuplot/

Libgd: libgd: http://www.boutell.com/

With all the excitement over the things that we can make Perl do, we sometimes lose sight of the fact that Perl's greatest strength is the ease with which you can use it to glue independent programs together into a single powerful application. This toolkit philosophy, in which large applications are built from many small command-line tools, is the great simplifying principle of the Unix operating system, and one that Perl readily takes advantage of.

This column addresses a case in point. Say you are interested in seeing the hourly breakdown of accesses to your Web site in order to do capacity planning. When does traffic peak and when is it at a minimum? Say that you want to be able to view this data graphically as a bar chart, and that you'd like the chart to be generated on the fly from a CGI script. How would you go about doing this?

One approach would be to do all the work in Perl. Running as a CGI script, Perl can parse the server's access log file, tally up the hourly hits, generate the bar chart using the graphics primitives in the GD or Image::Magick modules, and output the plot as a GIF image. The script would be responsible for drawing the plot axes, calculating the width and position of the bars, creating the X and Y tics, and drawing the labels. Although the job is relatively straightforward, the program would likely take at least half a day to write and debug, and would certainly be several pages of code by the time you were finished.

But why reinvent the wheel? There are many plotting packages available for Unix and Windows systems, and most of these can be called as subprocesses from within Perl. One of the more ubiquitous plotting packages is Gnuplot, an open source package that comes preinstalled on many Linux systems, and is available as source and precompiled binaries for a variety of Unix and Microsoft Windows architectures. By taking advantage of this existing tool, we can write this CGI script in just a few minutes. The complete application comes to just 46 lines of code.

stein-1

Figure 1: Gnuplot output

Using Gnuplot

Before we dive into the log-processing application itself, let's look at a simple use of Gnuplot from within Perl. Gnuplot was designed to draw complex mathematical functions and plot scientific data. It can be run interactively under the X Windows and Microsoft Windows systems, in which case plots are displayed in a graphics window, or run in batch mode from the command line, in which case the output graphics are written to standard output or a file. Gnuplot is able to generate a large variety of graphics file formats, including GIF and Postscript.

Gnuplot has a command language that can be used to control the graphs it generates. In interactive mode you can type in commands and watch them take effect immediately. In batch mode, you feed Gnuplot commands from a file or standard input.

Let's look at an example. From the Gnuplot command line, here's how to graph the equation y=sin(x)/cos(x):

gnuplot> plot sin(x)/cos(x)

In a fraction of a second, Gnuplot displays the graph shown in Figure 1. It's almost as easy to call Gnuplot as a subprocess from within Perl. Just open up a pipe to Gnuplot, and send it the commands. For example, Listing 1 shows a 6-line program to plot sin(x)/cos(x) from within Perl and output the result in GIF format. The script begins by defining a GNUPLOT constant indicating the path to the Gnuplot executable. It then opens up a pipe to Gnuplot by calling Perl's open() function with a pipe symbol as the first character in the command's name. The program next sends two commands to Gnuplot. The first (line 4) changes the default terminal type to gif, telling Gnuplot to output the graph as a GIF format image. Next, the script sends the plot command as before. The pipe is now closed and the program exits. Note that you'll need Gnuplot version 3.7 or higher to get GIF support, plus Tom Boutell's libgd graphics library. See the Packages Described box to obtain the most up-to-date versions.

Gnuplot can graph experimental data as well as equations. The simplest format it accepts consists of a two-column table indicating the X and Y values of a data series. Comment lines beginning with the # symbol are ignored. Listing 2 shows a data table containing one day's worth of hourly tallies on my laboratory Web site. The first column is the hour (using the 24-hour clock), and the second column is the total number of hits. Plotting this data as a bar chart is a matter of setting the chart type by providing the set data style command with boxes, and then issuing the plot command with the data file name as its argument:

gnuplot> set data style boxes
gnuplot> plot 'hours.dat'

By fussing with various set commands, you can adjust many aspects of Gnuplot's display, including the graph and axis labels, the position of the key, the positioning and frequency of tics, and so forth. However, Gnuplot is oriented towards scientific rather than business applications, so the number of plot types is limited. For example, piecharts are not supported.

What if you want the data table to be computed dynamically by an external program? Gnuplot provides the equivalent of Perl's open pipe syntax. By prepending the < symbol to the filename passed to the plot command, Gnuplot treats the filename as a command to execute. For example, if we had a Perl script named tally_log.pl whose job was to parse a Web access log and produce a tabular list similar to that in Listing 2, we could plot its output dynamically with this command:

gnuplot> plot '< tally_log.pl 
              ~www/logs/access_log'

Parsing Log Files

Running Gnuplot is half of our CGI script. The other half is the part that parses the Web server's log file.

The standard or common log file format used by most Web servers was established years ago by the NCSA httpd and CERN servers. Each line of the access log records a single hit on your site and is subdivided into the seven fields described in Table 1. A typical entry looks like this one:

phage.cshl.org - - [07/May/1999:01:17:19 -0400] 
                    GET / HTTP/1.0 200 6118

To parse the various fields from within Perl, just match each line against the following regular expression:

REGEX=/^(\S+) (\S+) (\S+) \[([^]]+)\] 
              (\w+) (\S+).* (\d+) (\S+)/;
while (<>) {
   ($host,$rfc931,$user,$date,$request,$URL,$status,$size)
              =m/$REGEX/o;
}

You can then tally or otherwise manipulate the fields in any way you choose.

stein-2

Figure 2: A barchart produced by tally_hourly.pl.

Putting it All Together

The tally_hourly.pl script parses the Web server's access log, tallies the hourly hits, and passes the result to Gnuplot. It's intended to be called from an inline <img> tag like this:

<img src=/cgi-bin/tally_hourly.pl?file=access_log.1>

The script expects a single CGI parameter named file, which tells the script what access log file to open and parse. On my Web site, log files are rotated once a day, access_log becoming access_log.1, access_log.1 becoming access_log.2, and so forth. So access_log.1 always holds the complete record of yesterday's activities. You can put multiple days' graphs on the same page just by repeating the <img> tag with different file parameters.

A page created by tally_hourly.pl is shown in Figure 2 on the previous page. For the day plotted, accesses peaked at 11:00 in the morning, and then peaked again at about 5:00 in the afternoon. This pattern is typical of scientific sites which are most heavily accessed during the work day. Recreational sites are more heavily used in the evenings. Listing 3 shows the code for tally_hourly.pl. The main trick used here is that the script is actually two programs which execute independently of one another (see the diagram in Figure 3 on the next page). The first program is a CGI script that processes the CGI file parameter and invokes Gnuplot. The second program is the log file parser. It is invoked by Gnuplot when Gnuplot runs its plot command. The script determines which context it's running in by looking at the @ARGV array. If @ARGV is not set, then the script is running in the CGI context. Otherwise, it's been invoked by Gnuplot in order to process the log file given by the first @ARGV argument.

stein-3

Figure 3: tally_hourly.pl is actually two programs:
a CGI script that calls Gnuplot, and a log file parser that is called by Gnuplot.

Turning to the listing, the script begins by activating Perl's taint check features with the -T switch. This is particularly important to do in this script because it will be shelling out to a subprocess using untrusted user-provided data (the value of the file parameter). The script then turns on strict syntax checking and imports various functions from the CGI and CGI::Carp libraries. Among the functions imported from CGI is the -no_debug pragma, which disables CGI.pm's command-line debugging features. This is important to do; otherwise CGI.pm might enter command-line debugging mode when it was invoked by Gnuplot. We bring in the fatalsToBrowser() function from CGI::Carp, which automatically redirects fatal errors to the browser, helping track down any script failures.

Line 4 hard-codes the PATH environment variable. This is necessary in order for the script to pass Perl's taint checks.

Lines 5-6 define file paths for Gnuplot and the directory that contains the server's log files. These constants will need to be adjusted for your system. Line 7 puts standard output into unbuffered (autoflush) mode. This is necessary when calling out to subprocesses to ensure that the output produced by the script and the subprocess appear in the order you intend rather than the order determined by incompatible I/O buffering schemes.

Line 11 examines the @ARGV array. If it is not set, then the script is running in the CGI environment. Otherwise, the script is running under Gnuplot. In the former case, the script recovers the log file name from the file parameter, untaints it using a pattern match, and prepends the log directory path to the untainted file name. The pattern match is set up to match any file name beginning with an alphabetic character and followed by zero or more alphanumeric characters, the hyphen, or the dot. This specifically excludes filenames containing shell metacharacters such as >, and relative path names like ... The script tests that the file exists and is readable. If so, it calls a function named generate_gif() to create the barchart (line 17).

If there is a command-line argument in the @ARGV array, then the script knows it has been invoked by Gnuplot. The command-line argument contains the full path to the log file to process. The script calls generate_data() to tally the indicated log file and produce the tabular summary for Gnuplot's use (line 22). Lines 24-34 contain the definition of generate_gif(). It emits an HTTP header with a content type field of image/gif. It then opens up a pipe to Gnuplot and sends it graphing commands. Most of the commands are constant boilerplate read from the script's --DATA-- section. I arrived at this set of commands by playing with Gnuplot interactively until the graph looked the way I wanted it. Then the subroutine sends Gnuplot the plot command, telling it to run a pipe constructed from the $0 variable, which holds the name of the currently running script, and the value of the file parameter. The plot command ends up looking something like this:

plot '< /www/cgi-bin/tally_hourly.pl/www/logs/access_log.1'

The subroutine closes the Gnuplot pipe and exits. Gnuplot sends its output to standard output, which then forwarded to the waiting browser.

Lines 35-46 define generate_data(), which is called on to process the log file and produce a summary in Gnuplot data format. It reads through the log file line by line using the <> operator and parses out the hour part of the request time. In this case, we don't need any of the other fields, so the more general regular expression that we looked at previously isn't needed. The parsed hour is added to a hash named %HITS, which keeps track of how many hits we've seen in each interval. When the subroutine reaches the end of the log file, it sorts the keys of %HITS numerically and prints out a two column table in Gnuplot format.

Simple Things Made Easy

The tally_hourly.pl script is another example of how Perl makes simple things easy, and hard things possible. By using a preexisting tool rather than rolling our own, a script that might have been a bear to write became almost trivial.

With a little more work, you could adapt this script for other sorts of log processing tasks, such as plotting hits by day of the week, summing over domain names, or tallying up the bandwidth used by the Web server. By replacing Gnuplot with a charting package oriented towards business graphics, you can produce pie charts, stacked column charts and other displays with little additional effort.

__END__


Lincoln D. Stein is the author of CGI.pm.


PREVIOUS  TABLE OF CONTENTS  NEXT 
\n