Guts: Basic Anatomy - The Perl Journal, Spring 1998

Chip Salzenberg

Welcome to the anatomy class. Please grab a smock and gloves and step up to the table.

In the previous (first!) installment of this column, I described the overall content and layout of the perl distribution. Now that the stage has been set, I'll focus on perl's implementation of the Perl language, with occasional excursions into the standard library.

Gross Anatomy

First, we'll survey the major portions of perl, and their functions. You'll notice that the word 'perl' isn't always capitalized in this column. That's because Larry Wall draws a distinction between 'Perl' the language and 'perl' the program. In theory, there could be other programs besides 'perl' that implement Perl. Thus, this column is not about 'the guts of Perl', but 'the guts of perl'.

The Story of a Perl Program. How does perl execute your Perl program? Here's a bird's-eye view.

It reads your program and breaks it down into basic syntactic elements, called tokens. (Lexical Analysis)
It figures out the meaning of those tokens in that specific sequence. (Parsing)
It builds an internal structure representing operations that, when executed, will perform the actions specified by your program. (Compilation)
It steps through those operations and performs them, one by one. (Execution)

Note that perl distinguishes step three from step four. The line is blurred by BEGIN {} blocks, which specify Perl code to be executed during the compilation phase - before your program runs.

Step 3 settles one of the frequently asked questions about Perl: perl is a compiler, not an interpreter. Specifically, it's a "load-and-go compiler." Incidentally, current plans for perl 5.005 have it include a module to produce C code directly from a Perl program; that would have been nearly impossible were perl not a compiler.

We'll examine these four steps in detail now.

1. Lexical Analysis

This subsystem takes the input stream (your Perl program) and interprets it as a sequence of tokens. Tokens are the basic elements of program syntax in any language. Perl's tokenizer is in the file toke.c.

For example, given the program

   print "Hello, world\n";

lexical analysis produces tokens that represent the operator print, the string "Hello, world\n", and the final semicolon.

Tokens vary from computer language to computer language, and even from implementation to implementation for a given language. What constitutes a token varies depending on how much complexity or regularity should be visible in the grammar. For example, given a Perl scalar $i, should the dollar sign and the letter be two tokens, or just one? The former makes the grammar more complicated and the lexical analysis simpler; the latter does the opposite. Neither is really wrong in itself; it depends on other factors, including history and consistency with existing code.

As you might imagine if you have any experience with writing compilers, lexical analysis of Perl is a complex and exacting task, full of special cases and best guesses.

Consider the expression new Foo. If the subroutine &new has been defined or declared when new Foo is encountered, this is interpreted as a function call - as if it had been spelled new(Foo). If there is no subroutine &new defined or declared, and if a package Foo has been defined through use, then the expression is taken as a method call, as if it had been spelled Foo->new. But if there is no package Foo, then the expression is taken as a function call after all. This is just a small example of how Perl's lexical analysis is aware of, well, just about everything.

There's enough interesting complexity in Perl's lexical analysis to fill several columns, so I won't go into detail here about how it works. Perhaps it will be enough for now to explain that the main entry point for perl's lexical analysis is a function called yylex().(If you’re familiar with the lex or flex utilities, you might assume yylex() is automatically generated by one of them. It’s not. yylex() is hand-crafted specifically for the special cases and best guesses that make Perl so useful.) The responsibility of yylex() is, whenever it's called, to return the next token from the script currently being compiled. Also, because each token is represented internally as a single integer, there's often additional data associated with it; yylex() puts this additional data into a global structure named yylval before returning the token number.

2. Parsing

The tokens produced by lexical analysis don't themselves constitute a program. Even if perl can enumerate the tokens in your program, it can't tell what your program means until it understands those tokens in a valid order and context. Perl's grammar is a description of all valid arrangements of tokens, and what those arrangements mean.

PerlGuts Illustrated

Gisle Aas’ tutorial-in-progress (with pictures!) about Perl’s guts is available on the CPAN and at http://home.sol.no/~aas/perl/guts/

PerlGuts Illustrated
Gisle Aas’ tutorial-in-progress (with pictures!) about Perl’s guts is available on the CPAN and at http://home.sol.no/~aas/perl/guts/

It is possible to implement a grammar in a general-purpose language like C. But that's unnecessarily difficult and error-prone. Specialized tools for writing grammars and translating them to executable code have existed for decades; one such tool is yacc. The name is an acronym for "Yet Another Compiler Compiler."

yacc grammars contain a set of grammatical rules, and actions (C code) to be taken when a rule is triggered. It generates C code that, when executed, "accepts" that grammar and performs the specified actions. This may seem a bit vague and magical, but it's really simple in concept: yacc writes C code for a function named yyparse(). When called, yyparse() repeatedly calls yylex() and does whatever the tokens demand.

The grammar of perl is designed for input to yacc.(Actually, that's not quite true. The files perly.c and perly.h are not actually created with yacc, but rather with byacc (also known as "Berkeley yacc"), a clone of yacc originally written by Robert Corbett. To be precise, perl uses byacc 1.8. ) It is distributed in the file perly.y. You might expect that the perl build procedure includes running yacc on this file. However, to save Perl builders the trouble, and because yacc isn't available everywhere, the distribution of perl includes perly.c and perly.h, created from perly.y by the creator of the perl distribution.

3. Compilation

The purpose of the yacc grammar is to build trees of structures called OPs. Each OP represents a low-level operation to be performed at runtime. Its primary field is op_code, an integer that represents the action to be performed when execution reaches the given OP. Valid opcodes are enumerated in opcodes.h, which is automatically generated from opcode.pl before distribution. One OP, alone and afraid, isn't much good to anyone. So OPs are built into trees, which are then either stored for later use (as subroutines) or executed immediately.

There is a lot of low-level manipulation of OP structures that occurs during compilation. Not much of it would make sense if I described it out of context. Suffice it to say there's a fair amount of code, particularly in op.h and op.c, to define OPs, create them, and build and optimize trees of them.

4. Execution

With this subsystem, we cross the Rubicon from preparing to doing. Until now we've been getting ready to run. Now we run.

The execution stage is the part of perl that users can imagine most easily. For each opcode OP_FOO, there is a function pp_foo() that is called to execute the given OP. For example, pp_add() adds two numbers; pp_chdir() changes the current directory, and so on.

Many opcodes are relatively simple to execute. Some, however, are quite complex, or have significant portability issues, or both. So there is a lot of code in this subsystem, more than you might expect from a surface examination of perl's list of opcodes.

Is That All There Is To It?

There's more to perl than these four stages. It's a large program, and as such has many widely used services that aren't tied firmly to any particular stage. For example, exception handling involves all four stages.

To save space and time, code that implements such idioms or provides such services is typically collected into what would be a library if it were compiled and installed separately. When it's not installed separately, it's sometimes called a 'subsystem.' We'll look at some of those subsystems now.

Internal Flow Control and Exception Handling. Perl's semantics include exception handling. As perl runs, it may at almost any time interrupt the flow of execution and unwind the call stack back to a previously established checkpoint. This behavior is visible from Perl in the form of the eval() and die() operators, but it can also be triggered by perl's internals, such as by the detection of internal errors or a user's attempts to perform illegal operations.

This subsystem implements Perl's exception-handling semantics. Nonlocal transfer of control is the (relatively) easy part of exception handling: perl uses the standard C functions setjmp() and longjmp() for that purpose. The (relatively) hard part is freeing allocated memory and performing other cleanup operations, which perl monitors with a stack of actions to be performed when leaving each dynamic scope. If that doesn't make a lot of sense to you, relax. Or breathe deeply and dive into scope.h and scope.c. Or both.

User-Visible Data Structures. Perl's user-visible data structures are scalars (including references and type globs), arrays, hashes, and subroutines. This subsystem implements them. It's fairly straightforward, as perl goes. Its major complication is "magic", a catch-all feature used to implement tied variables, special variables (like $!, $., and $1), operator overloading, and a few other miscellaneous features.

Regular Expressions. Perl's regular expressions are one of its cardinal features. The subsystem that implements them for perl is deep magic. I mean deep magic. Here there be dragons.

That's not to say that perl's regex code is impossible to understand. However, it is written for speed, not clarity. There are so many special cases that the general flow is difficult to untangle, which makes grasping the fullness of the regex code a real challenge. I'll get back to this subsystem in a future column. In the meantime, treat it as a black box.

Input/Output. Perl's input/output model is quite simple in concept. Most of the complication arises from variations in configuration. While perl has historically used C's stdio interface directly, recent versions of perl have been designed to use the sfio ('Simple Fast I/O', an optional extension) library instead.

An intermediate layer, the 'PerlIO' abstraction, lives in between perl and whatever I/O library you're using. It provides the point of control for perl's configuration procedure. In theory, PerlIO should allow you to host perl on any adequately rich I/O library.

Standard Classes. In present releases of perl, this subsystem is tiny, consisting entirely of just one class: UNIVERSAL. Run the command perldoc UNIVERSAL for details of its services.

Support for Extensions. This subsystem consists of support for user-written extensions - modules that include not only Perl code but also executable binary code, typically written in C for speed. This subsystem is also tiny; by my accounting, it consists entirely of a single header file with some convenient macros. The majority of perl's support for extensions isn't in the core, but in the build process, library, and external utilities.

Portability and Configuration. Since perl is one of the world's most portable large programs, there is a lot of bulk to its configuration mechanism. It's so extensive that Larry Wall created a tool, called dist, just to automate some of the more tedious work of maintaining and extending it.

From the point of view of a user installing perl in a reasonably Unix-like environment, configuration starts with the execution of a massive shell script named Configure. (Configuration procedures in non-Unix environments will vary widely; see the appropriate README for information.)

Configure has two primary jobs: Examining the system to figure out what facilities are available for perl to use; and asking the person building perl to specify his preferences for the build procedure and runtime options. The primary product of Configure is a configuration file named config.sh that has the form of a Bourne shell script. Here are selected lines from perl 5.004_04, configured for my Linux machine:

archlib='/usr/lib/perl5/i486-linux/5.004' 
bin='/usr/bin' 
byteorder='1234' 
cc='gcc' 
ccflags='-g -pipe -Dbool=char -DHAS_BOOL -DFIXNEGATIVEZERO -DDEBUGGING'
cf_email='chip@pobox.com' 
d_sigaction='define' 
extensions='Fcntl GDBM_File IO Opcode POSIX SDBM_File Socket' 
lddlflags='-shared -L/usr/local/lib' 
ldflags='-g -L/usr/local/lib' 
libs='-lm -lgdbm -ldl' 
privlib='/usr/lib/perl5'

As you can see, config.sh contains a mixture of information about hardware ($byteorder), operating system, compiler, and library ($cc, $d_sigaction), and user preferences ($bin, $privlib).

After Configure creates config.sh, it calls the various shell scripts in the perl source tree that have the suffix .SH. Each .SH script creates a configured output file based on the information in config.sh: Makefile.SH creates Makefile, config_h.SH creates config.h, and so on. Also, during the build process, the Perl module Config.pm is created to provide a simple Perl interface to config.sh. All these files based on config.sh are how perl configures its build process and behavior.

lex and yacc

lex is a program that generates C programs. The programs it generates are lexical analyzers, which group a stream of characters into meaningful tokens. yacc is a program that generates C code. The code generated is called a parser, which executes actions whenever particular patterns of tokens are found. When you use yacc, you provide the patterns (a grammar) and a list of actions.
lex and yacc are often used together. The GNU equivalents are called flex and bison, respectively. byacc is a Berkeley version of yacc - that's what Perl uses.
lex isn’t as useful as yacc. It's often a good idea to bypass lex and write your own lexical analyzer; it’s almost never a good idea to emulate yacc's behavior on your own.

lex and yacc
lex is a program that generates C programs. The programs it generates are lexical analyzers, which group a stream of characters into meaningful tokens. yacc is a program that generates C code. The code generated is called a parser, which executes actions whenever particular patterns of tokens are found. When you use yacc, you provide the patterns (a grammar) and a list of actions. lex and yacc are often used together. The GNU equivalents are called flex and bison, respectively. byacc is a Berkeley version of yacc - that's what Perl uses. lex isn’t as useful as yacc. It's often a good idea to bypass lex and write your own lexical analyzer; it’s almost never a good idea to emulate yacc's behavior on your own.

Next Issue

Next issue, I'll go into more ugly (and yes, I do mean ugly) detail about perl's lexical analysis.

_ _END_ _

Chip Salzenberg has been programming for almost twenty years. For most of those years he has promoted free software. He was coordinator and primary programmer for Perl 5.004. His major solo project was Deliver, the free email local delivery agent that allows flexible, safe, and reliable message handling. Chip's hobbies include patching Perl, tending to his six parrots, and memorizing Mystery Science Theater 3000 episodes. (Hikeeba!)

TABLE OF CONTENTS