The man(1) of descent - The Perl Journal, Winter 1998

Damian Conway

Packages Used

Perl 5.004 or later CPAN
Parse::RecDescent CPAN

Being a scholarly Treatife on the Myfterious Origins and diverfe Ufes of that Module known as Parfe::RecDescent

So who cares about parsing anyway?

Er, well, humans do. Our brains seem to be hard-wired for a syntactic view of the world, and we strive (often unreasonably) to find or impose grammatical structures in our lives. Homo Sapiens is a species evolved for language (or possibly, by it). We are compulsive and incessant parsers: of written text, of spoken words, of our children's faces, of our dogs' tails, of our politicians' body language,(It is said that an infallible non-verbal cue that a politician is lying is touching of the nose. Or covering the mouth. Or adjusting shirt cuffs. Or smiling without showing any teeth. Or looking down. Or pointing. Or fidgeting. Or breathing.) of the simple grammar of traffic lights, and the complex syntax of our own internal aches and pains. You're parsing right now: shapes into letters, letters into words, words into sentences, sentences into messages, messages into (dis)belief!

So it's not surprising that programmers (many of whom were once human) should be concerned with parsing too. If you use any of the modules in the Pod::, Date::, HTML::, CGI::, LWP::, or Getopt:: hierarchies, or the Expect module, or TeX::Hyphen, or Text::Refer, or ConfigReader, or PGP, or Term::ReadLine, or CPAN, or Mail::Tools, or one of the database interface modules, or even if you just read in a line at a time and match it against some regular expressions, then you're parsing. Sometimes in the privacy and comfort of your own home.

Each of those modules contains a chunk of custom-made, carefully-tuned, walnut-veneered code, which takes ugly raw data and sculpts it into finely-chiseled information you can actually use.

Of course, if the CPAN doesn't supply a parsing system for the particular ugly raw data you need to process, then you're going to have to roll your own. That's what parser generators like Parse::RecDescent are for.

Anti-diff-erentiation

To use such tools, you first need to know something about formal grammars. Now just relax! This next bit won't hurt.

Suppose, for example, you want to "reverse" the output of a diff (i.e. as if the two files had been compared in the opposite order). Here's a typical diff output:

17,18d16
< (who writes under the
< pseudonym "Omniscient Trash")
45,46c43,44
< soon every Tom, Dick or Harry
< will be writing his own Perl book
---
> soon every Tom, Randal and Larry
> will be writing their own Perl book
69a68,69
> Copyright (c) 1998, The Perl Journal.
> All rights reserved.

Each chunk encodes a single transformation of the original file. For example, the first three lines mean "at lines 17 to 18 of the original file perform a 'Delete' operation, leaving us at line 16 in the second file." It then shows the two lines to be removed, each prefixed by a less-than sign.

The next six lines mean "at lines 45 to 46 of the first file do a 'Change' operation, producing lines 43 to 44 in the second file." It then shows the lines to be replaced (prefixed by less-than signs) and the replacement lines (with greater-than signs).

The last three lines mean "at line 69 in the first file do an 'Append' operation, to produce lines 68 to 69 in the second file." The lines to be added are then listed, prefixed by greater-than signs.

So a diff output consists of a series of operations, each of which in turn consists of a command followed by context information. Commands consist of a single letter sandwiched between numbers that represent line ranges in each file. Context information consists of one or more lines of left context, right context, or both, where each context line consists of a line of text preceded by an arrow.

SAY IT GRAMMATICALLY

Phew! That's a lot of structure for such a small piece of text. But it's useful structure.

Let's take this English and turn it into a formal grammar that conveys the same information more precisely (and in a more compact and readable format):

DiffOutput     -> Command(s) EOF

Command        -> LineNum A_CHAR LineRange 
                     RightLine(s)
                   | LineRange D_CHAR LineNum
                     LeftLine(s)
                   | LineRange C_CHAR LineRange
                     LeftLine(s)
                     SEPARATOR
                     RightLine(s)
LineNum        -> DIGITS

LineRange      -> LineNum COMMA LineNum
                   | LineNum

LeftLine       -> LEFT_BRACKET TEXTLINE

RightLine      -> RIGHT_BRACKET TEXTLINE

The way to read a formal grammar is to replace each right arrow (->) with the words "...consists of..." and each vertical bar (|) with the words "...or possibly...". The parenthesized (s) means the same as in English, namely, "...one or more times". So, for example, the grammar above says that a LineRange consists of either a LineNum subrule, then a COMMA, and finally another LineNum, or possibly just a single LineNum.

Each of the six chunks of the above grammar specifies a separate rule. Each identifier on the left of an arrow is the rule's name and each list of items to the right of an arrow (or a vertical bar) specifies the sequence of things that together comprise that rule. Such sequences are called productions. The vertical bars separate alternative productions, any one of which is an acceptable set of components for the rule as a whole.

Items in a production may be either the name of some other rule (such items are called non-terminals or subrules), or the name of something which represents some actual text (such items are called terminals). In the above grammar, the nonterminals are in mixed-case and the terminals in upper-case.

The use of named terminals implies that something else - usually, a specially written subroutine called a lexer - is going to preprocess the original input text and transform it into a sequence of labelled strings (called tokens). It is the labels on those tokens that will (we hope) match the terminals. More on lexers and lexing later.

REVERSING A diff

If you gave the full grammar to a parser generator, you'd get back a piece of code able to recognize any text conforming to the grammar's rules and, more importantly, identify each part of the text (17,18 is a LineRange, d is a D_CHAR, 16 is a LineNum, and so on) Of course, recognizing the text is only half the battle. Our goal is to reverse it.

Fortunately, reversing a diff output is easy. You just change every 'Append' command to a 'Delete', and vice versa, and then swap the order of line numbers and the contexts associated with every command (including those belonging to 'Change' commands). The contexts also have their leading arrows reversed.

Doing that with the example text above you get:

16a17,18
> (who writes under the
> pseudonym "Omniscient Trash")
43,44c45,46
< soon every Tom, Randal and Larry
< will be writing their own Perl book
---
> soon every Tom, Dick or Harry
> will be writing his own Perl book
68,69d69
< Copyright (c) 1998, The Perl Journal.
< All rights reserved

In order for the above grammar to automatically reverse diff output, it needs to know what to do whenever it recognizes something reversible (i.e. a complete command). You can tell it what to do by embedding blocks of code, which the parser then automatically calls whenever it successfully recognizes the corresponding production.

Adding the equivalent Perl actions into the above grammar:

DiffOutput       -> Command(s) EOF

Command     -> LineNum A_CHAR LineRange
                   RightLine(s)
                    { print "$item3 d $item1\n";
                      $item4 =~ s/^>/</gm;
                      print $item[4]; }
                | LineRange D_CHAR LineNum
                   LeftLine(s)
                    { print "$item3 a $item1\n";
                      $item4 =~ s/^</>/gm;
                      print $item[4]; }
                | LineRange C_CHAR LineRange
                   LeftLine(s) SEPARATOR RightLine(s)
                    { print "$item3 c $item1\n";
                      $item4 =~ s/^</>/gm;
                      $item6 =~ s/^>/</gm;
                      print $item6, $item5, $item4; }

LineNum      -> DIGITS

LineRange    -> LineNum COMMA LineNum
                    | LineNum

LeftLine     -> LEFT_BRACKET TEXTLINE

RightLine    -> RIGHT_BRACKET TEXTLINE

Note that the actions need access to the various bits of data which have been recognized, in order to reprint them. In this example, they access the data by referring to the lexically scoped variables $item1, $item2, $item3, and so on. These variables are automagically assigned the sub-strings of the original text that were matched by each item in the current production (just as $1, $2, $3, etc. automatically receive the values of the last matched parentheses of a regular expression). So the production:

Command  -> LineNum A_CHAR LineRange
               RightLine(s)
               { print "$item3 d $item1\n";
                 $item4 =~ s/^>/</gm;
                 print $item[4]; }

has now been told that if a Command rule matches an 'Append' command, then it should print out the text matching the LineRange (item 3 of the production), then a 'd', then the original text which matched the LineNum (item 1). Then it should reverse the arrows on the context lines (item 4) and print those out too.

From Grammar to Reality

Having specified the grammar, and its behavior when various productions are recognized, all you need to do is build an actual parser that implements the grammar. Of course, building the parser is the tedious and difficult bit. And that's why there are automated parser construction tools like Parse::RecDescent, and its distant cousins yacc, perl-byacc, PCCTS, Parse::Yapp, and libparse.

MANY PATHS UP THE MOUNTAIN

When it comes to such tools, the lazy programmer is spoilt for choice. There are dozens of freely-available packages for generating parsers in a variety of languages (but mostly in C, C++, Java, and Perl).

Almost all of the parser generators belong to one of two families-top-down or bottom-up-whose names describe the way their parsers go about comparing a text to a grammar.

BOTTOMS UP...

Bottom-up parsers ( Better known as "LR" parsers: "scan Left, expanding Rightmost subrule(in reverse)". ) are usually implemented as a series of states, encoded in a lookup table. The parser moves from state to state by examining the next available token in the text, and then consulting the table to see what to do next. The choices are: perform an action, change to a new state, or give up in disgust.

A bottom-up parse succeeds if all the text is processed and the parser is left in a predefined "successful" state. Parsing can fail at any time if the parser sees a token for which there is no suitable next state, and is therefore forced to give up.

In effect, a bottom-up parser converts the original grammar into a maze, which the text must thread its way through, state-to-state, one token at a time. If the trail of tokens leads to a dead end, then the text didn't conform to the rules of the original grammar and is rejected. In this metaphor, any actions that were embedded in the grammar are scattered throughout the maze and are executed whenever they are encountered.

The bottom-up "maze" for the diff-reversing grammar above looks like Figure 1.

Figure 1: bottom-up maze

...WITH THE TOP DOWN

Top-down parsers work quite differently.(** a.k.a. "LL" parsers: "scan Left, expanding Leftmost subrule".) Instead of working token-to-token through a flattened-out version of the grammar rules, top-down parsers start at the highest level and attempt to match the most general rule first.

Thus a top-down parser would start by attempting to match the entire DiffOutput rule, and immediately realize that to do so it must match one or more instances of the Command rule. So it would attempt to match a Command rule, and immediately realize that to do so it must match one of the three productions in the Command rule. So it would try to match the first production, and realize that it must first match a LineNum, and so on.

It would keep moving ever further down the hierarchy of rules until it actually reached a point where it could try and match a terminal (against the first available token). If the terminal and token matched, the parser would move back up the hierarchy looking for the next bit of the enclosing production, and following that down until the next token was matched. If all the parts of the production were matched, the parser would move back up, matching higher and higher rules, until it succeeded in matching the top-most rule completely.

If at some point a token fails to match, then the parser would work back up the tree, looking for an alternative way of matching the original rule (i.e. via a different production of the same rule). If all alternatives eventually fail, then the parse as a whole fails.

In other words, a top-down parser converts the grammar into a tree (or more generally, a graph), which it then traverses in a top-to-bottom, left-to-right order. If a particular branch of the tree fails to match, then the parser backs up and tries the next possible branch. If all branches fail, then the parse itself fails. In this approach, embedded actions represent leaves on the tree, and are executed if the traversal ever reaches them.

The top-down "tree" for the diff-reversing grammar looks like Figure 2.

Figure 2: top-down tree

THE DESCENT OF RECDESCENT

Tools for building bottom-up parsers have a long and glorious history, dominated by the yacc program developed by Stephen Johnson at Bell Labs in the mid 1970's. Yacc accepts a grammar very similar to those shown above and generates C code, which can then be compiled into a function that acts as parser for the language the grammar describes.

Rick Ohnenus adapted a modern variant of yacc to create perl-byacc, which generates Perl code directly. Later, Jake Donham developed a set of patches allowing perl-byacc to generate parser objects (rather than parsing functions).

Recently, Fran&ccecil;ois Desarmenien went one better and implemented a complete object oriented yacc-like parser generator, called Parse::Yapp, directly in Perl (it also generates its parsers as Perl code).

Top-down parsing has been less popular, but one prominent tool which uses this approach is PCCTS, developed as a PhD research project by Terence Parr. PCCTS provides much more power and flexibility than the yacc family, but generates C++, not Perl code.

John Wiegley's libparse bundle is an extremely comprehensive and ambitious parsing system that does generate Perl code. It is still in alpha release, but can already generate top-down parsers, at least on NT machines. Finally, my own Parse::RecDescent module - the topic of the rest of this article - creates very flexible top-down parsers in Perl. Listing 1 provides a comparison of some of these tools.

The Journey of 1000 Miles...

Most parser generators involve an unpleasantly complicated ritual. First, you create a lexer to split up your text into labelled tokens. That usually involves hand-crafting some code or using a lexer generator such as lex, flex, or DLG. Next, you create your grammar and feed to a parser generator. The parser generator spits out some more code: your parser. Then you write your program, importing both the lexer and parser code you built previously. You hook the three bits together and voilá! Typically that looks something like Figure 3.

Figure 3: Parser Generator

Of course, this multi-stage, multi-task, multi-file, multi-tool approach quickly becomes a multi-lobe headache.

ONE STEP

Building a parser with Parse::RecDescent looks like this:

Figure 4: Parser with Parse::RecDescent

The entire parsing program is specified in a single file of Perl code. To build a parser, you create a new Parse::RecDescent object, passing it the required grammar. The new object provides methods which you can then use to parse a string. For example, here is the complete diff reverser implemented using Parse::RecDescent:

#!/usr/bin/perl -w

use Parse::RecDescent;

my $grammar = q{
  DiffOutput:Command(s) /\Z/
  
  Command:
     LineNum 'a' LineRange RightLine(s)
       { print "$item[3]d$item[1]\n";
         print map {s/>/</; $_} @{$item[4]};
       }
   | LineRange 'd' LineNum LeftLine(s)
       { print "$item[3]a$item[1]\n";
         print map {s/</>/; $_} @{$item[4]};
       }
   | LineRange 'c' LineRange
     LeftLine(s) "---\n" RightLine(s)
       { print "$item[3]c$item[1]\n";
         print map {s/>/</; $_} @{$item[6]};
         print $item[5];
         print map {s/</>/; $_} @{$item[4]};}
   | <error: Invalid diff(1) command!>
   
LineRange:
   LineNum ',' LineNum
     { join '',@item[1..3] }
   | LineNum
   
LineNum: /\d+/
LeftLine: /<.*\n/
RightLine: />.*\n/
};
my $parser = new Parse::RecDescent($grammar);

undef $/;
my $text = <STDIN>;

$parser->DiffOutput($text);

The structure of this program is straightforward. First you specify the grammar as a string ($grammar). Next you create a new Parse::RecDescent object, passing $grammar as an argument to new(). Assuming the grammar was okay (and in this case it was), new() returns an object, which has one method for each rule in the grammar.

You then read the entire STDIN into another string ($text) and pass that text to the parser object's DiffOutput method, causing the parser to attempt to match the string against that rule.

As the match proceeds, the various actions specified for each rule will be executed. If the text is successfully parsed, the actions which get fired off along the way will cause the program to print a "diff-reversed" version of its input.

Scalpel... Retractor... Forceps

Let's dissect the program a little. First, observe that the grammar passed to Parse::RecDescent::new() is very much like the one in the examples above, except that it uses a colon to define new rules, instead of an arrow.

Note too that the grammar constitutes the vast bulk of the program. In fact, the rest of the code could be reduced to a single line:

Parse::RecDescent::new($grammar)->DiffOutput(join '',<>);

HANDLING ITEMS

Another minor difference is that Parse::RecDescent uses an array (@item) to represent matched items, instead of individual scalars ($item1, $item2, etc.) The automagical behavior of this @item array bears a little more explanation. When a subrule in a production matches during a parse, that match returns the scalar value of the last item matched by the subrule. That returned value is then assigned to the corresponding element of @item in the production which requested the subrule match.

That means that if a rule consists of a single item (for example, LineNum: /\d+/) then the substring matched by that regular expression is returned up the parse tree when the rule succeeds. On the other hand, if a production consists of several items in sequence (e.g. LineRange: LineNum ',' LineNum), then only the value of the second LineNum subrule would be returned.

It's a reasonable default, but if it's not the behavior you want, you need to add an action as the last item (e.g. {join '', @item[1..3]}), to ensure that the appropriate value is returned for use higher up in the grammar.

HANDLING REPEATED ITEMS

Another special case to keep in mind: if you specify a repeated subrule in a production (like Command(s) or LeftLine(s) ), the scalar value that comes back to @item is a reference to an array of values, one for each match. That's why the print statement for the 'c' command prints @{$item[4]} (i.e. the actual array of values matched by LeftLine(s) ), instead of just $item[4].

By the way, the various map operations applied to these arrays simply reverse the angle brackets, as diff requires.

NO LEXER

The other major difference between the Parse::RecDescent version of the grammar and those shown earlier is that, where the previous grammars had token names like DIGITS, SEPARATOR, A_CHAR, and EOF as its terminals, this version has quoted strings ('a' instead of A_CHAR, '---\n' instead of SEPARATOR), or regular expressions (/\d+/ instead of DIGITS, /\Z/ instead of EOF).

This is because, unlike nearly all other parser generators, Parse::RecDescent creates a parser that doesn't require a separate lexer subroutine to chew its input into predigested pieces.

Instead, when a Parse::RecDescent parser reaches a quoted string or regular expression in one of its productions, it simply matches the specified text or pattern against its original input string. This is known as "on-the-fly" or "context-sensitive" tokenization, and is a more flexible strategy than the usual pretokenization, because it allows different productions to interpret the same input differently.( A well known case that cries out for such context-sensitive lexing is in the grammar for C++, where nested templates (such as List<List<int>>) routinely generate syntax errors because the pretokenizing lexer classifies the trailing >> as a single "right shift" token. A Parse::RecDescent parser would have no such difficulties, should it ever demean itself to parse something as ugly as C++.)

ERROR HANDLING

A final addition to the Parse::RecDescent version of the grammar is the fourth production of the Command rule:

<error: Invalid diff(1) command!>

This is not a specification of what to match, but rather a directive that tells the parser what to do if it reaches that point without having matched anything else.

This particular directive tells the parser to generate the error message Invalid diff(1) command and then fail. That failure will be fatal, because it will stop the repeated Command subrule from matching any further, and will cause the following /\Z/ terminal to not find the end of the input.

Surely you can't be serious?

Grammar-based parsing certainly isn't restricted to processing rigidly defined data formats such as diff produces. Here's another example that's about as different as you can get. It's the classic "Who's on first?" routine, as performed by a pair of Parse::RecDescent parsers named $abbott and $costello (see Listing 2).

Each of the parsers encodes a particular world-view, which is a combination of interpretations of sentences it is parsing, together with some text generation logic that allows it to respond appropriately to those interpretations.

The Abbott grammar treats each line as either a statement that requires confirmation (a ConfirmationRequest), or as a name or position enquiry (a BaseRequest or NameRequest), for which the appropriate fielding position or player name should be supplied.

In contrast, the Costello grammar treats each parsed line as either a rephrasing of a previous question (a Question) that it tries to confirm, or as an ambiguous statement (UnclearReferent) that it tries to clarify, or as an inexplicable utterance (a NonSequitur) that derails the entire dialogue and forces it to start the conversation from scratch.

The humor resides in the fact that, although the responses provided by each parser are entirely consistent (within each grammar's world-view), they mesh seamlessly into inescapable mutual incomprehension.

Making Conversation

This example illustrates some additional features of the RecDescent package: parser return values, object oriented parsing, explicit rejection, "inlined" subrules, optional and specified-repetition subrules, and rule lookaheads.

PARSER RETURN VALUES

Just as each subrule in a Parse::RecDescent parser evaluates to a single scalar value, so too the rule through which parsing is initiated (e.g. $costello->Interpretation) returns a scalar value. Specifically, it returns the value assigned to the Interpretation rule within the grammar.

The program uses this feature to extract responses from Abbott and Costello. Each response is generated within some lower-level subrule of each grammar by embedded actions that either analyze whether the recognized statement is true (ConfirmationRequest) or look up the information requested (NameRequest) or, in desperation, choose a random (but relevant) response (UnclearReferent). The generated response bubbles back up the parse tree, to be returned by the top-most rule. The while loop then feeds that response into the other parser, whose response is in turn fed back to the first parser. The ensuing alternation generates the dialogue.

Although it can't happen in either of these parsers, the parser returns undef if the initial rule fails to match. Hence a common idiom in Parse::RecDescent parsers is:

$result = $parser->InitialRule($text)
      or die "Bad input!\n";

OBJECT ORIENTED PARSING

Parsers generated with Parse::RecDescent are Perl objects (specifically, blessed hash references), and the Costello grammar takes advantage of that fact in two ways.

First, it makes use of Parse::RecDescent::choose() function (which is not part of Parse:: RecDescent, but was added by the main program), to randomly select from a range of possible responses.

Second, the Costello grammar adds an extra member to the $costello parser object: $thisparser->{prev}. It uses that to cache information between parses (more on that in a moment).

The $thisparser variable is automatically generated by Parse::RecDescent has abilities far beyond those described so far. Some of its more interesting abilities follow. Parse::RecDescent, and is the way in which actions in a grammar can refer to the parser which is executing them. It's much like the variable $self used when writing method subroutines for Perl classes.

EXPLICIT REJECTION

You can explicitly cause a production to reject itself under specified conditions with the <reject:...> directive. This directive takes a Perl expression and evaluates it. If the expression is true, then the directive immediately causes the current production to fail. If the expression is false, the directive is ignored and matching continues in the same production.

The Costello grammar uses this facility in order to avoid repeating its responses. Its top-level rule:

Interpretation:
      Meaning <reject:$item[1] eq $thisparser->{prev}>
          { $thisparser->{prev} = $item[1] }
  | { choose(@::try_again) }

first tries to match a Meaning, but then rejects the match if the returned response is the same as the previously-cached value (in $thisparser->{prev}). If the returned response isn't rejected, the following action caches it for next time. And since the action is the last item of the production, it returns that value on behalf of the entire Interpretation rule.

On the other hand, if the response was rejected, then the first production immediately fails, and the second production is tried. That production immediately chooses a random response from the list in @try_again and returns it as the value of the entire Interpretation rule. Rejection can even take into account factors outside the grammar. For instance, a rule like this:

Filename:
      /\w+/ <reject: !-r $item[1]>

allows your grammar to recognize an identifier as a filename, but only if the string is the name of a readable file.

"INLINED" SUBRULES

Alternative subproductions can be specified in parentheses within a single production. For example, the subproduction:

NonSequitur:
 ( "Yes" | 'Certainly' | /that's correct/i )
 { choose("$item[1], who?", "What?", @::try_again) }

lets you perform the same action for Yes, Certainly, or anything matching /that's correct/i.

OPTIONAL AND SPECIFIED-REPETITION SUBRULES

Just as subrules in a production can be made repeatable with a trailing (s), they can also be made optional with a trailing (?), or optional and repeatable with a trailing (s?), or even made to repeat an exact number of times if a numeric range is specified in the trailing parentheses.

For example, the rule:

BaseRequest: Preface(s?) Name /(is)?/

specifies that a name may be preceded by zero or more prefaces. Alternatively, you could set a limit with an explicit range:

BaseRequest: Preface(0..10) Name /(is)?/

RULE LOOKAHEADS

Parse::RecDescent also provides a mechanism similar to zero-length lookaheads in regular expressions, better known as (?= ) and (?! ). Parse::RecDescent lookaheads act just like other production items when matching, except that they do not "consume" the substrings they match. That means the next item encountered after a lookahead item will attempt to match at the same starting point as that lookahead.

Negative lookaheads are specified with a ...! prefix and are typically used as guards to prevent repeated subrules from eating more than they should. The Abbott grammar, for example, uses a negative lookahead to make sure that the Preface subrule doesn't consume a significant name:

Preface: ...!Name /\S*/

The ...! preceding Name means that, to match a Preface, the string being parsed must not have a Name before the specified pattern.

Positive lookaheads are specified with a ... prefix, and are also useful for ensuring that repetitions don't get too greedy. They aren't used in the Abbott or Costello grammars, but they're handy in situations like this:

Move: 'mv' File ExtraFile(s) Destination
   | 'mv' File File
   
ExtraFile: File ...File

Destination: File

Here, an ExtraFile only matches a File followed by another File. That means that exactly one file will be left after the ExtraFile subrule has matched repeatedly, ensuring that the trailing Destination item will match. Without the trailing lookahead, ExtraFile would match all the files, leaving nothing for Destination.

It Does What??!

Parse::RecDescent has abilities far beyond those describwed so far. Some of it's more interseting abilities follow.

AUTOMATED ERROR REPORTING

The <error: message> directive used in the diff-reverser prints a fixed message when triggered. Parse::RecDescent provides a related directive, <error>, which automatically generates an error message, based on the specific context in which it's triggered. Consider this rule:

Command:
LineNum 'a' LineRange RightLine(s)
| LineRange 'd' LineNum LeftLine(s)
| LineRange 'c' LineRange
  LeftLine(s) "---\n" RightLine(s)
| <error> <error>

If none of the preceding productions had matched anything, it would automatically generate the message:

ERROR (line 10): Invalid Command:
           was expecting LineNum or LineRange

when fed invalid diff output at line 10.

INTEGRATED TRACING FACILITIES

Tracking down glitches when a large grammar is applied to a long string can be tricky. Parse::RecDescent makes it easier with extensive error checking, and a tracing mode that provides copious information on parser construction and execution.

As the parser is constructed from a grammar, Parse::RecDescent automatically diagnoses common problems, such as invalid regular expressions used as terminals, invalid Perl code in actions, incorrect use of lookaheads, unrecognizable elements in the grammar specification, unreachable rule components, the use of undefined rules as production items, and cases where greedy repetition behavior will almost certainly cause the failure of a production.

Other, less critical problems are reported if the variable $::RD_WARN is defined. Extra information and hints for remedying problems can be also supplied by defining the variable $::RD_HINT.

Another variable, $::RD_TRACE, can be used to trace the parser's construction and execution. If $::RD_TRACE is defined prior to parser construction, Parse::RecDescent reports (via STDERR) on the construction progress, indicating how each element of the grammar was interpreted. This feature can be especially useful for tracking down mismatched quotes or parentheses, and other subtle bugs in the grammar specification.

These debugging facilities are designed for unobtrusiveness and ease of use. They can be invoked without making any changes to the original grammar, which reduces the incidents of heisenbugs. Likewise, the use of main package variables as switches makes it easy to debug from the command line, without hacking the code at all (e.g. perl -s parserprog.pl -RD_WARN -RD_TRACE ).

POSITION INFORMATION WITHIN ACTIONS

You can refer to $thisline, thiscolumn, and $thisoffset in any action or directive. These store the current line number, column number, and offset from the start of the input data. For example:

Sequence: BaseSequence(s)

BaseSequence: /[ACGT]+/
   | <error: alien DNA detected at byte
     $thisoffset $thisoffset!>

PARSE TREE PRUNING

Sometimes you know more about the structure of the input than can easily be expressed in a grammar. Such knowledge can be used to prune unpromising productions out of the parse tree.

On certain inputs a production may be able to commit itself at some point, effectively saying: "At this point, we know enough that if the current production fails, all the rest must also fail." For example, in this rule:

Command: 'find' <commit> <commit> Filename
       | 'delete' <commit> <commit> Filename
       | 'move' Filename 'to' Filename

the first production can commit itself after matching find, since the other two productions have no hope of matching in that case.

ARGUMENT PASSING AND GENERIC RULES

Any non-terminal in a production can be passed arguments in a pair of square brackets immediately after the subrule name. Any such arguments are then available (in the array @arg) within the corresponding subrule.

Better still, since double-quoted literal terminals and regular expression terminals are both interpolated during a parse (in the normal Perl manner), this means that a rule can be parameterized according to the data being processed. For example, the rules:

Command: Keyword Statement(s)
         End[$item[1] $item[1]]

Keyword: 'if' | 'while' | 'for'

End: "end $arg[0] $arg[0]"

ensure that command keywords are always matched by a corresponding "end keyword" delimiter, because the Command rule passes the matched keyword ($item[1]) to the End subrule, which then interpolates it into a string terminal (end $arg[0] ) to be matched.

It's even possible to interpolate a subrule within a production by using the <matchrule:...> directive:

Command: Keyword Statement(s)
         <matchrule:"End_$item[1]"> <matchrule:"End_$item[1]">
Keyword: 'if' | 'while' | 'for'
End_if: 'end if'
End_while: 'end while'
End_for: 'end for'

In this version the <matchrule:"End_$item[1]"> directive interpolates the text matched by Keyword into a new subrule (either End_if, End_while, or End_for) and then attempts to match it.

DEFERRED ACTIONS

One of the limitations of embedding actions in rules is that they are executed as the rule is matched, even if some higher level rule eventually fails. At best, that means wasted effort; at worst, it leads to incorrect behavior, especially if the actions have side effects such as printing.

For example, if the diff-reversing grammar were to encounter incorrect input half way through its parse, it would have already printed some of the reversed commands when it finds the error, reports it, and stops. That's just plain messy.

Fortunately, parser actions in Parse::RecDescent can be deferred until the entire parse has succeeded. Such deferred actions are specified using the <defer:...> directive around curly braces.

Here's the diff-reversing grammar from the first example, with deferred actions instead:

DiffOutput: Command(s) /\Z/

Command: LineNum 'a' LineRange RightLine(s)
  <defer: <defer:
       { print "$item[3]d$item[1]\n";
         print map {s/>/</; $_} @{$item[4]}; } > >
| LineRange 'd' LineNum LeftLine(s)
  <defer: <defer:
       { print "$item[3]a$item[1]\n";
         print map {s/</>/; $_} @{$item[4]}; } > >
| LineRange 'c' LineRange
LeftLine(s) "---\n" RightLine(s)
  <defer: <defer:
       { print "$item[3]c$item[1]\n";
         print map {s/>/</; $_} @{$item[6]};
         print $item[5];
         print map {s/</>/; $_} @{$item[4]}; } > >
| <error: Invalid diff(1) command!>

LineNum: /\d+/

LineRange: LineNum ',' LineNum
           { join '',@item[1..3] }
         | LineNum
		 
LeftLine: /<.*\n/

RightLine: />.*\n/

Now the various calls to print are only executed once the entire input string is correctly parsed. So the grammar either reverses the entire diff or prints a single error message.

As each deferred action is encountered during the parse, it is squirreled away and only executed when the entire grammar succeeds. More importantly, deferred actions are only executed if they were part of a subrule that succeeded during the parse. That means that you can defer actions in many alternative productions, and Parse::RecDescent will only invoke those that belonged to subrules contributing to the final successful parse.

Grammers and Metagrammers: Listing 3.

EXTENSIBLE GRAMMARS

This is probably the weirdest science in Parse::RecDescent. Unlike other parser generators, you can extend or replace entire rules in a parser at run time. The bit that usually freaks yacc devotees out is that you can even change the grammar while the parser is parsing with it!

For example, suppose you are parsing a programming language that allows you to define new type names, and that those new type names are subsequently invalid as normal identifiers. In Parse::RecDescent you could do it like this:

TypeName: 'number' | 'text' | 'boolean'

TypeDefn: 'type' Identifier 'is' TypeName
  { $thisparser->Extend("TypeName: '$item[2]'") }

Identifier: ...!TypeName /[a-z]\w*/i

Traditional parsers for languages like this require you to send semantic feedback to the lexer, to inform it that certain identifiers are now to be treated as type names instead. Parse::RecDescent allows you to do the same job using the much simpler notion of lexical feedback, by making the parser update itself when a type definition effectively modifies the language's syntax.

Do What, John?

So what do people actually do with Parse::RecDescent and the other parser generators available? Here's a list of typical applications:

Building natural language parsers for user interfaces, text summarization, or searching;
Extracting data from structured reports (like those generated by system monitors, execution profilers, or traffic analyzers);
Implementing command languages, query languages, configuration file processors, or even full programming language emulators in Perl;
Creating parsers for special purpose data specification languages (scene description languages for computer graphics, for example)
Constructing analyzers for large structured data sets (such as genetics and biology databases)
Building decommenters, syntax checkers, or pretty-printers for various languages (sorry, there are no available Parse::RecDescent grammars for Perl, C, or C++ yet!)
Building data translators (Pod to XML, XML to RTF, RTF to /dev/null)
Processing of complex comma-separated values (i.e multiple mixed formats, handling unquoted commas, etc.)

Anywhere that you can specify the structure of some data with a grammar, you can build a recognizer for that data using Parse::RecDescent.

Whither Parse::RecDescent?

Parse::RecDescent is powerful, but far from perfect. Some of its limitations stem from the type of parsers it builds, others from its own ragged evolution.

LEFT-RECURSE YE NOT!

The most significant limitation is, unfortunately, inherent to Parse::RecDescent's nature: like all top-down parsers, it doesn't do left recursion. That means it can't build parsers for certain perfectly reasonable grammars. For instance, a fairly typical way of specifying the left associativity of addition is to write a rule like this:

Addition: Addition '+' Term | Term

which ensures that terms are eventually grouped to the left.

Bottom-up parsers like yacc have no trouble unravelling left-recursive rules, but top-down parsers can't handle them, because the first thing they would have to do when matching an Addition is to match an Addition, which in turn requires them to first match an Addition, and so on.

Parse::RecDescent can at least diagnose these cases for you (they can be much more subtle than the example above), but it will probably never be able to build parsers for them. Fortunately that isn't a big problem, since you can achieve the same effect with this:

Addition: (Term '+')(s?) Term

THE NEED...FOR SPEED

Parse::RecDescent parsers aren't always very fast, especially with large grammars or long inputs. Fortunately, that's not an immutable property of the universe, and version 2.00 (which should be in the CPAN for Christmas) will feature a considerable speed boost.

COMING ATTRACTIONS

Other features slated for future Parse::RecDescent releases include:

probabilistic matching, where each production in a rule will compute its own probability of success as it matches, and the highest probability production for each rule will be the one accepted (rather than just the first which succeeds);
tokenized input, so you'll be able to use a preliminary lexer if you choose to (which may significantly speed up some parsing tasks where context-sensitivity isn't an issue);
automatic backtracking within productions with repetitions, which will eliminate the problems of overly greedy matches "starving" later items;
non-greedy repetitions, which will match a minimal number of times, rather than a maximal number (just like non-greedy qualifiers in regular expressions);
an <autocommit> directive that tells the parser to analyze a grammar and automatically insert <commit> and <uncommit> directives wherever they will do some good.

"Would you like to know more?"

Here's where to find more information on all of the above:

The man(1) of descent. Charles Darwin published The Descent of Man in 1871, becoming the first scientist to publicly suggest that you could arrange your gran'ma in a Past Tree. The complete original text is available online from http://www.literature.org.

Sorry? You wanted more on parsing?

Well, for an FMTYEWTK introduction to the field, it's hard to go past the dragon book: Aho, Sethi & Ullman, Compilers: Principles, Techniques, and Tools, from Addison-Wesley. Or for a less rigorous, but more readable approach, try The Art of Compiler Design, by Pittman & Peters (Prentice Hall).

A few online resources are also worth looking at. Two good places to start are the catalog of parsing tools at: http://www.first.gmd.de/cogent/catalog and the even larger repository of tools at: http://www.km-cd.com.dragon_fodder/html/ToolBuildingBNF.html.

If you're specifically interested in using yacc and lex, the Mothers of All Parser Generators, O'Reilly publishes the excellent lex & yacc (second edition).

Parse::RecDescent, Parse::Yapp, and libparse are each available from the CPAN, while PCCTS is available from its homepage at http://dynamo.ecn.purdue.edu/~hankd/PCCTS/.

Finally, if you want to know more about Parse::RecDescent, there's a 40 page POD manual that comes with the distribution. You can also point your fevered browser at: http://theory.uwinnipeg.ca/CPAN/data/Parse-RecDescent/Parse/RecDescent.html.

Deo (Et Al) Gratias

My deepest gratitude to Nathan Torkington for suggesting this article, and for his heroic efforts cleaning up the Augean stables of my prolixity. No good deed ever goes unpunished. Thanks also to the (foreign) legion of Parse::RecDescent testers, users, and other guinea-pigs for their feedback, complaints and wild-and-crazy suggestions. In particular, Helmut Jarausch, Theo Petersen, and Mark Holloman have a lot to answer for.

__END__

Damian Conway (damian@csse.monash.edu.au) lives in a former penal colony where he leads young minds to the Dark Side (by teaching them object oriented software engineering in C++). He is also guilty of the Text::Balanced, Lingua::EN::Inflect, and Getopt::Declare modules. Those last two tied for the 1998 Larry Wall Award for Practical Utility, which their author views as a damning evidence of endemic drug use in Perl's upper echelons.

TABLE OF CONTENTS