Converting C to English with Perl - The Perl Journal, Fall 2000

Omri Schwartz

Resources
c2eng..............................................http://www.mit.edu/~ocschwar/decss2.pl

I've been loosely following the recent litigation over the DeCSS program. DeCSS is a C program floating around the Internet that decrypts data encoded in CSS (Content Scrambling System), the encryption scheme used by DVDs. Sites posting the source code have been the victim of legal action, sometimes under the justification that source code is not speech, and therefore isn't protected by the First Amendment in the U.S.

Should source code be regarded as a form of protected expression? I think it should, and so I wrote a program called c2eng that takes ANSI C code and translates it to grammatically correct and descriptive English sentences. For instance, that means taking the C preprocessing statement #include <math.h> and turning it into "This program makes use of the system file math.h.", and turning #define Pi 3.14159265358979 into "Note: we define Pi to mean '3.14159265358979'". With c2eng, the global declaration int I,J,K; becomes "Specifying the type integer, allocate the variables 'I', 'J', and 'K'." Naturally, the tool I chose for this task is Perl.

The first thing I did was to ask around Usenet for suggestions about how to go about this project. Several folks pointed me in the direction of Damian Conway's Parse::RecDescent module. After downloading the module, I read more about a similar utility called yacc, which takes a description of a language (called a grammar) and generates a program that can parse (or "understand", in a weak sense) texts conforming to that grammar. You might use yacc to create a compiler for a language of your own design; in fact, perl itself uses a heavily-modified variant of yacc to parse Perl programs.

In this article, I'll explain how I arrived at a fully-functioning C to English translator. You'll get the most out of it if you already know the basics of parsing; see Damian Conway's article on Parse::RecDescent in TPJ #12 for a thorough introduction to parsing in general, and his module in particular.

Why Parse::RecDescent?

Parse::RecDescent offers several advantages over other parsers for this task. The primary advantage is that a typical entry in a grammar looks like this:

	  Rule : Subrule1 Subrule2

But with Parse::RecDescent, you can include arbitrary bits of Perl code to be executed during the parse:

	  Rule : Subrule1 Subrule2
	        { $return = $item{subrule2} . $item{subrule1}; }

This lets us assemble a string ($return) and return it to whatever superrule called Rule. This was exactly the flexibility that I needed. It let me store English strings like "specified as the type" in $return and pass them from rule to rule.

So what's the top-level rule, the one that actually does the printing? The first construct you find in a C program can be a global variable declaration, a function prototype or definition, a comment, or a preprocessor directive. Our top-level rule looks like this:

  startrule : (
	preproc {print $item{preproc};}
	| comment {print $item{comment};}
	| global_var_declaration {print $item{global_var_declaration};} 
	| function_definition {print $item{function_definition};}
	| function_prototype {print $item{function_prototype};}
  )(s)

This tells Parse::RecDescent to collect the return string from any of those constructs and print the result to STDOUT. The parentheses around the subrules indicate that we might find more than one startrule in the program. (Usually, we will.)

Note that startrule is the only rule allowed to print anything. As c2eng meanders through a C program trying to figure out how the component characters are divided into higher-order terms, expressions, and statements, it will try out multiple interpretations as it moves through the program, only to realize later that an interpretation is wrong. At that point it backs up and try another interpretation -- but once you've printed something you can't undo it, so we have to constrain the printing to this top-level rule.

As I wrote c2eng, I posted several questions to the comp.lang.perl.misc newsgroup and received assistance from Damian and Randal Schwartz. (And by "assistance", I mean being walked through many problems step-by-step.) Damian sent me a C grammar that had been written for Parse::RecDescent, which I used as the basis for my work.

Math Is English, Too

In 1557, a scholar named Robert Recorde wrote a treatise on arithmetic containing this famous quote:

To avoide the tediouse repetition of these woordes: is
equalle to: I will settle as I doe often in woorke use,
a paire of paralleles, or gemowe [twin] lines of one
lengthe: =, bicause noe .2. thynges, can be moare
equalle.

Yes, the universal language of our mathematical notation began with the impatience of a man writing an entire book by hand, in a dank unlit room, with inkwell and quill. If our mathematical symbols are simply contractions for normal English woordes, we should be able to take any equation and translate it into a grammatically correct sentence.

Once I figured out how to descend recursively through a grammar and translate algebraic equations into mechanical English prose, I just needed to put in the Perl code to construct it. Enter the %item and $return variables, which c2eng uses to amass the English output. At the end of every rule we can have what Parse::RecDescent calls an "action"; in most of our rules, the action is to assign a value to $return. For instance, here's a rule involved in parsing mathematical expressions:

  rel_add_mul_shift_expression : 
	  <leftop: cast_expression rel_mul_add_ex_op cast_expression>
  {
	  $return = join('',@{$item[1]}); 
  }

Parse::RecDescent provides @item and %item variables that let a rule find out what a subrule returned. The act of parsing is sufficient for determining which characters combine into expressions, but the extra meaning of those expressions -- their English representation -- has to be passed upward to superrules through variables like $return that gather the meaning of the current rule and store it for the superrule.

That's all very abstract. What it means in this case is that the rel_add_mul_shift_expression rule, which parses a mathematical expression like a + b, is able to examine the English strings associated with a and b and join them together. It turns this:

  prod[2][1] = a[2][1]*b[1][1]+a[2][2]*b[2][1];

to this:

Assign to array prod's element at address (2,1) the value "array a's
element at address (2,1) times array b's element at address (1,1) plus
array a's element at address (2,2) times array b's element at address
(2,1)".

It's not poetry. But it is, at least, pronounceable.

There's much more to say about the difficulties of parsing C with an eye toward generating readable English output. For instance, the statement bar = baz; can be translated as "Assign to 'bar' the value of 'baz'." That suggests a certain way of writing the rule that handles assignment statements. But an overly simplistic implementation might turn a statement like foo = bar = baz; into "Assign to 'foo' the value of assign to 'bar' the value of 'baz'", which doesn't quite roll off the tongue. Instead, c2eng performs a slightly more rigorous analysis, yielding the more palatable Assign to 'foo' the value 'bar' which is assigned to be 'baz'.

In some situations, we need to know what comes after a rule we've parsed. For example, when we use - as a unary operator (e.g. foo = -bar or baz = -1), we say "minus" if what follows is a constant, and "negative" if it's not. Parse::RecDescent lets us peek ahead with the ... construct:

unary_operator : [ other things] 
	| '-' ...constant {$return  = 'negative ';}
	| '-' {$return = 'minus ';}

This translates -1 to "negative one" while -x is translated to "minus x".

Simple mathematical expressions aren't that hard to express in English: we just need to imitate a very boring math teacher. In comparison, flow control statements and function definitions are easy to verbalize. It's easy to translate foo(bar) to "Perform the function 'foo' as applied to the argument 'bar'." What's hard, however, are parentheses.

Parentheses are critical in mathematics, but they unfortunately don't translate very well into English because of their stackable nature. We can handle stacks of concepts when we talk or read, as long as the stack doesn't get too deep. (And "too deep" means more than about six levels. (For more about this, read Douglas Hofstadter's "Gödel, Escher, Bach, an Eternal Golden Braid". (Basic Books, 1979.)))

When a math teacher says something like "the quantity x plus y, all over z...", the phrases "the quantity" and "all" signal to the listener that the stack depth is changing. In spoken conversations, hand gestures are sometimes used to indicate particular depths as well.

Which leaves us in a quandary. What do we do with parenthetical expressions in C code? We could replace each opening parenthesis with "the quantity" and each closing parenthesis with "now", but what about nested parentheses? Mechanical repetitions of "the quantity" or "now" would convey the meaning of the underlying C, but it wouldn't sound pleasant since we'd never talk that way in real life, even if we were trying to convey a complicated mathematical expression. The parenthesis-handling portion of c2eng is displayed in Listing 1.

When c2eng encounters a set of nested parentheses, it displays a phrase like "the 22-layered parenthetical expression". If it encounters multiple parentheses in a row, it says so: "(now drop 22 layers of context)". Since phrases like these interrupt the flow of communication, c2eng puts them in, yes, parentheses.

c2eng works, and can translate the vast majority of C programs into readable, if boring, English. I'm continuing to refine the program to improve the output, but the proof of concept is there. There are still many issues to be worked out; for instance, when a C compiler parses a C program, it immediately strips out comments and runs the C preprocessor. c2eng doesn't, because I want the English output to contain all of the information in the C program. This is important, because we want our translation to be perfectly reversible, converting from C to English, and then from English back into C. Perhaps this will help courts understand that to code is to talk and to talk is to code.

_ _END_ _

TABLE OF CONTENTS