Guts: Lexical Analysis - The Perl Journal, Fall 1998

Chip Salzenberg

Requirements

Just Perl

"In tonight's episode, recurring character Chip Salzenberg goes spelunking for archaeological evidence of pre-Perl camels, only to be captured by the Brotherhood of the Horned Llama. But first, in a touching farewell scene, he tries to explain to his fellow archaeologicians just what is the deal with Perl's syntax..."

FIRST,WE STRIP ALL THE WHITESPACE

It's been said that "the only program that can parse Perl is perl." If you've ever tried to get a smart editor like Emacs to properly indent your Perl program, you'll probably agree. And while Ilya Zakharevich has made great strides with cperl-mode.el, Perl's syntax is still more complex and exception-ridden than most. Now, ask yourself: Given that Perl's syntax is riddled with oddities, exceptions, and attempts to do what you mean instead of what you say, what bizarre twists and turns must a program take to understand it?

Well, you're about to find out.

I ANALYZED TOKE.C BUT I DIDN'T INHALE

In our last episode, we toured the major subsystems of perl 5. One subsystem is "Lexical Analysis." It's the only part of perl 5 that sees your actual source code in all its glory (?), so speak well of it when it's in earshot.

Lexical analysis consists of turning a source file - a single unbroken stream of characters - into discrete units, called tokens. (That's why lexical analysis is often called tokenizing.) Tokens are the fundamental units of a programming language. Typical tokens are identifiers like foo, literal strings like 'bar', and operator names or symbols like print or +.

The next stage after lexical analysis, called parsing, takes those tokens and, based on their context, figures out what they mean. After all, foo might be a subroutine name, a filehandle, or even a variable name if it follows a dollar sign. But the full glory of parsing is a subject for another day (or week).

Lexical analysis of Perl is a seriously hairy job, and toke.c contains some seriously hairy code. (Like the rest of Perl, the tokenizer is written in C.) You'd probably find an exhaustive treatment of its ins and outs to be, well, exhausting. I certainly would. Fortunately for all of us, there is not enough room within the margins of this magazine for that. Instead, we'll be hitting some of the high points - specific features that are representative or revealing. Follow along at home if you have the strength. Otherwise, just remember that when you're done reading this column, toke.c is there. Waiting for you.(This column is based on toke.c from 5.005-beta-1, since 5.005 wasn't quite released when this column's deadline arrived. Well, actually, beta-1 wasn't released by my deadline either. But let's not get picky about timing - that's Jon's job. )

TOKEN FOR GRANTED

The details of lexical analysis vary from language to language, but one constant endures: its job is to turn text into tokens for feeding into the parser. But how does the lexing code communicate those tokens to the parser? The answer may surprise you, because it's so simple.

The primary interface to the lexer is the function yylex. The parser calls yylex when it wants another token, and yylex obligingly returns the next token from the current Perl program. Eventually, yylex serves your entire Perl program up to the parser, one token at a time.

yylex doesn't accept any parameters, so we must keep state information about the current Perl program in global variables. Now, this may seem like a strange and restrictive interface. Why not allow parameters, which would eliminate the need for nasty globals? And, for that matter, why use the odd name 'yylex'?

Because Perl's parser just has to have its way. Allow me to tell the not-really-so-sad tale: the parser is mechanically generated from the grammar in perly.y. Perl's build process uses a program called byacc (a variant of yacc, Yet Another Compiler Compiler) to generate a program that can parse that grammar. (You can find the generated parser in the files perly.h and perly.c bundled with the Perl distribution, but don't expect much; they're not intended for human consumption.)

Now, yacc is a wonderful tool and saves everyone a lot of grunt work - everyone, that is, except for the person writing the lexer. That's because the lexer's author has to put up with yacc's quirks, one of them being that yacc-generated parsers must get their tokens by calling a function called yylex with no parameters. This is what we engineers call a "tradeoff."

Another quirk of yacc-generated parsers is that they require tokens to be represented as unique small integers. Zero represents end of input. Integers between 1 and 255 usually represent the ASCII character ; for instance, the token for an 'A' is 65, then ASCII code for 'A'. Larger integers (256 and up) represent more complex tokens, which needn't be single characters.

The complete list of tokens supported by the lexer is rather long and would be fairly confusing at this point. When you're ready, see perly.h or the top of perly.y. In the meantime, here's a taste:

WHILE
loop control keyword:while or until

LSTOP
list operator: print, sort, and others

DOTDOT
range or list constructor: '..' or '...'

MULOP
an operator with the same precedence as *: /, %, x, or * itself.

You may have noticed by now that something is missing from the interface to yylex. It's all well and good for yylex to return the token WHILE when it sees the keyword while or until - it does this because the syntax for both is identical, even though the meanings differ. But when it's time to build the OP tree for a loop, it's vitally important for the parser to know the difference between them. How does the parser know which of those keywords triggered the WHILE token?

The answer is a second and key part of the interface between yylex and the parser: the global variable yylval. When tokens require additional information to fully describe them, yylex is required to put that information into the global variable yylval before returning. The type of the additional information varies from token to token: It may be as simple as a line number, or as complex as a large tree of Perl OPs. Therefore, yylval is a union of various data types, as specified with the %union directive in perly.y. The grammar in perly.y has %type directives that tell yacc which tokens are accompanied by additional data in yylval, and the types of those data.

So the job of yylex each time it's called is to figure out what the next token is, set yylval with any additional data appropriate for the token, and return the token.

Incidentally, perl has a catch-all token: THING. The lexer returns THING whenever it would prefer to just build an OP tree and hand it to the lexer instead of returning tokens and letting the parser build the OP tree. If it weren't for yylval's ability to hold any data type (including an OP pointer), this convenient shortcut would not be possible.

THINK GLOBALLY, DECLARE LOCALLY

As we've already seen, the lexer has to keep state in global variables (seeing as how yacc requires yylex to take no parameters). You may be wondering just what kind of state it has to keep, and what variables are used to keep it. Well, I'm glad you're wondering that, because right here is an (incomplete) list of the global variables that the lexer uses to tokenize your Perl programs:

SV *linestr

The linestr variable holds the current input line from the script being compiled. The SV data type is one of the core data structures of perl 5. It holds a Scalar Value. Therefore, anything that is legal to put in a scalar variable in Perl can be stored in an SV. Here, however, the full power of the SV is not needed; linestr is never anything fancier than a simple string. Still, using an SV saves us the trouble of tracking memory allocation as the buffer grows and shrinks. (Perl's good at string manipulation, as you may have noticed.)

char *bufptr
char *bufend

The bufptr and bufend variables point into the buffer held by linestr - specifically, the current position and the next byte beyond the end of the current line. These are two of the most commonly used variables in the lexer.

FILE *rsfp

The file that the current Perl program is in. When the source code in linestr has been completely tokenized, it's refilled by reading more source code from this file.

line_t copline

The current input line number, incremented each time a newline is encountered.

enum expectation expect

The expect variable records the lexer's idea of what it is likely to see next. expect can be any of XOPERATOR, XTERM, XREF, XSTATE, XBLOCK, or XTERMBLOCK. This variable is critical to many of Perl's context-dependent syntax rules.

For example, curly braces might represent a naked block or an anonymous hash constructor. How does the lexer know the difference? By checking expect. When expect is XSTATE, the lexer is eXpecting a STATEment; thus, a left brace is taken as the start of a naked block. But when expect is XTERM, the lexer is eXpecting a TERM (part of an expression); thus, a left brace is interpreted as the beginning of an anonymous hash constructor. Likewise, an asterisk may be a multiplication symbol or the first character of a typeglob. With expect, the lexer sorts them all out correctly.

Computer science purists will decry the cooperation between parser and lexer required to maintain a correct value for expect. The CS ideal is for the lexer to be able to work without getting a cheat sheet from the parser. Well, all we can say is: It's a fair cop, but society's to blame. Perl would be nigh on impossible to parse without this parser-lexer love-in, because it's not a traditionally pure CS language (i.e. written by a professor with an axe to grind).

U32 lex_state

You might expect that when the lexer sees a double-quoted string, it limits its efforts to finding the end of the string. (That's how most languages' lexers work.) And, in fact, the lexer includes a subroutine called scan_str that does just that.

But if that were all the lexer did with strings, then the parser would have to figure out all those tricky behaviors related to case shifting and variable interpolation. In particular, consider the interpolation of array and hash elements: their subscripts are full Perl expressions. It would be a Bad Idea to duplicate all of the lexer's logic in the parser just because the lexer didn't want to dig into a double-quoted string.

Fortunately, Larry did things another way. When the lexer encounters a double-quoted string, it digs right in. It actually goes so far as to return fabricated tokens to the parser, making the parser think that any fancy double-quoted string behaviors were spelled out the long way using Perl operators. For example, \Q$x\E$y is returned to the parser as though the programmer had written join('', quotemeta($x), $y). (That's why the operation of \Q is affected when you import a customized quotemeta function.)

The changing value of lex_state helps the lexer keep track of the different parts of fancy double-quoted strings and the different rules that apply to their various parts.

lex_inwhat

Perl plays the same games with regular expressions as it does with strings, creating the illusion of Perl operators to implement case modifiers and interpolation. For example, the lexer still calls scan_str (via scan_pat) to grab the regex text.

However, there are some differences between the interpolation rules of regexes and strings. For example, the regex /(a$)/ doesn't interpolate the $), whereas the string "(a$)" does. The variable lex_inwhat keeps track of the current interpolation rules, so that the lexer can adapt for the differences between regexes and strings.

I32 lex_casemods
char *lex_casestack

The lexer uses lex_casemods and lex_casestack to keep track of how case modifiers are nested. (In this context, the phrase 'case modifiers' includes \Q, even though it doesn't actually work with letter cases.) For example, "\Qhi.\Ubye." evaluates to 'hi\.BYE\.'. Without a stack like lex_casestack, if the lexer saw the \U it would be unable to remember the \Q still in progress, so the effect of the \Q would stop, and the second dot wouldn't get backslashed.

I32 nexttoke
I32 nexttype[]
YYSTYPE nextval[]
lex_defer

Sometimes the lexer decides that it knows not only what the next token is, but also what the token after that is, and sometimes even the token after that. For example, when it sees foo Bar, it may decide that foo is a method name and Bar is a package name. However, yylex can't return more than one token at a time.

To simulate returning multiple tokens, the lexer keeps pending tokens in these variables. It then sets ex_state (discussed above) to LEX_KNOWNEXT to remind itself that it already knows what the next token is supposed to be, and lex_defer to the value that lex_state would otherwise have had. Subsequent calls to yylex return the saved tokens, one at a time, until they are all used up, at which time lex_state is reset to the value of lex_defer and normal lexing resumes.

As you can see, lexical state is a complicated assortment of various data. This is a reflection of the lexical complexity of Perl itself. Remember, Larry wrote Perl to be like a natural language, and if you think Perl is lexically complex, just think about the translation quality of your last VCR manual.

DECISIONS,DECISIONS

There are many little choices that were made when Perl's tokenization was designed. In this section, I'll describe a few of those decisions.

Pseudo-EOF. To support consistent behavior across various systems, a ^D or ^Z (ASCII code 4 or 26) is considered just as good as a real end of file.

NUL. NUL bytes (ASCII code zero) are ignored outside of string constants. However, yylex keeps the current line in linestr, which is an SV; and all SVs keep a NUL byte after the end of the strings they hold. So Perl's lexer takes advantage of that: it uses the NUL after the end of linestr to trigger behavior that should happen when the line buffer is empty.

One such behavior is the introduction of fake source code. Now, don't panic - it's perfectly normal. Consider the -n command line flag, which effectively wraps your entire Perl program in a while (<>) loop. You might expect that what really happens is a lot more complex than that, but it's not. It turns out that the code for the while () is introduced in the first call to yylex, by the code that notices a NUL and figures out that the line buffer is empty. So it's doing its job by providing more code to tokenize - not necessarily code from your program, but code nonetheless. So the -n flag literally does wrap your program in a while loop.

Barring such tricks, though, the code that handles NUL refills the line buffer by reading a line from rsfp, which is open on the current Perl program during its compilation. (Perl saves and restores rsfp, and most of the other variables we discussed here, when processing require or use statements.)

And if perchance the line that is read is the first line of a Perl program, then here in yylex is the code that interprets the "shebang" line (the line starting with #!) If you've ever wondered why Perl is able to read multiple options from a shebang line (e.g. #!perl -w -l) even when your operating system supports only one option on a shebang line, the answer lies here in yylex.

And now you know why perl 5.004 has the new warning Too late for -T option. Taint mode isn't effective unless it's done from the very start, i.e. from the real command line. By the time yylex is called, a lot of taint-related things should already have happened, so perl can't guarantee that your data is safe. To avoid this problem, always make -T the first option on the shebang line.

Whitespace. All whitespace is ignored outside of string constants (although it can separate tokens that might otherwise run together). But yylex keeps aware of newlines so it can maintain copline.

Dashes. A dash can be either a unary or a binary operator. Of course, that's not unusual in Perl (nor in other languages, for that matter). But there is something unusual about the dash in Perl: It's the first character of a file test operator.

Perl's lexer makes a special case for a dash followed by a single letter: It returns it as the token UNIOP-that is, a unary operator. (A code indicating which unary operator is returned in yylval.) So, despite its odd appearance, a file test operator like -f follows the same grammatical rules as other unary operators like chr and int.

Equal signs. An equal sign can be the first character of = =, or =~, or =>. But if it immediately follows a newline (i.e. is at the left margin), then it may also be the introduction to a piece of Plain Old Documentation (pod) embedded in the program. The job of yylex includes skipping over embedded documentation and getting on with tokenizing the rest of the program.

HAD ENOUGH?

No, I didn't think so!

For more information, you might search out:

toke.c. The source of all evil - er, tokens. There is a lot more evil lurking here in the heart of Perl. Feel free to browse, yet also feel free to skip the parts that are completely incomprehensible; you wouldn't want to suffer a conniption in front of your coworkers.

perly.y. Perl's grammar. This is the definition of the larger structure of the Perl language. Here is where you find out just how those tokens returned by yylex are allowed to be arranged in a valid Perl program. (And it's also where you find out just what's done with the additional information that the lexer puts in yylval.)

-Dp. If you want to see the various tokens returned by the lexer in real time, compile your perl binary with -DDEBUGGING and then run perl with the -Dp option. It will show you debugging output from the parser, including a description of the tokens it's working with and the grammatical rules that those tokens are triggering.

Warning: This generates a lot of output. But it can reveal to you where things are going wrong when you thought you had valid Perl code but your perl binary disagrees with you.

BUT WAS IT ALL A DREAM?

"In our next episode: Chip Salzenberg has gone missing, kidnapped by the Brotherhood of the Horned Llama and dragged into the Caves of the Ancient Spitting Ones. Yet his voice can still be heard, muttering strange phrases like 'indirect objects' and 'class methods' and 'bearwords are barewords'. Is Chip lost forever? Are the echoes merely the memory of a brave but fairly stupid code diver? Or may he yet return, bearing tales of the Brotherhood's objects and methods? Tune in to find out!

__END__

Chip Salzenberg has been programming for almost twenty years. For most of those years he has promoted free software. He was coordinator and primary programmer for Perl 5.004. His major solo project was Deliver, the free email local delivery agent that allows flexible, safe, and reliable message handling. Chip's hobbies include patching Perl, tending to his six parrots, and memorizing Mystery Science Theater 3000 episodes. (Hikeeba!)

TABLE OF CONTENTS