PREVIOUS  TABLE OF CONTENTS  NEXT 

Taking Perl to the Java Virtual Machine

Brian Jepson

Packages Used
JPL: bundled in Perl developer releases since       5.005_54
B::JVM::Jasmin                                                       CPAN
Jasmin                                http://mrl.nyu.edu/meyer/jasmin/

Since the release of JPL (a collection of libraries and modules for linking Perl and Java), there has been a great deal of interest in finding some method of converting Perl source code into Java class files. And why not? At least in theory, it should be possible to take advantage of all the work that's gone into Malcolm Beattie's Perl compiler to write a back end that emits a Java class file.

Because I've done a little bit of hacking on JPL, I've been interested in this myself. So, about a year ago, I set out to learn something about compilers. In a move that met this goal (and had the added side effect of increasing the average student age), I enrolled in a compiler design class at the University of Rhode Island. The decision to take the class was made easier by the fact that the class project involved writing a compiler for the Java virtual machine.

In the summer of 1999, I decided to apply what was left of my knowledge from that class. Determined to write something (anything!) that could take Perl source in one end and spit out a Java class file on the other end, I wrote a simple module that handles a subset of Perl. It's not something you could get a lot of use out of yet, but it is a start.

Let's take a look at some of the things I learned in class and while writing B::JVM::Jasmin. As we do this, we'll learn a few things about the mysterious Perl Compiler as well.

The Basics

Anyone who wants to write a program that generates a Java class file is going to encounter the same problems. The freely available Jasmin assembler makes one of these problems go away, and the fact that we are working with the Perl Compiler makes another go away.

What's a Virtual Machine?

A virtual machine is a computer that you can't break by sticking a screwdriver between the motherboard and CPU. That's because a virtual machine is a conceptual entity. Like a computer's CPU, a virtual machine can take a list of instructions and execute them in sequence. Virtual machines generally exist first as a specification, and then as implementations.

The Java virtual machine uses a stack for many elementary operations. For example, if you want to multiply two numbers, you push those two numbers on the stack, and then execute a multiplication instruction, such as imul. This multiplication instruction removes the top two numbers from the stack, multiplies them together, and puts the result back on the stack.

Let's consider a simple virtual machine that has a stack and understands six instructions:

InstructionAction
push NUMBER          Push the number on the stack.
addRemove and add the top two
numbers on the stack, leaving
the result on the stack.
mulRemove and multiply the topz
two numbers on the stack,
leaving the result on the stack.
printRemove and print the number
on the top of the stack.

How would this virtual machine handle a simple task, such as printing out the result of 2 * 3? The task would be broken down into a series of machine instructions:

push 2    ; push 2 onto the stack
push 3    ; push 3 onto the stack
mul       ; multiply the top two stack items
print     ; display the result

Figure 1 shows how the stack responds to the first three instructions.

Now that we've seen how this virtual machine should work, let's look at a simple implementation in Perl:

 my @stack = ();
   
   # Read all the lines that follow __ DATA __ in this file.
   #
   while (<DATA>) {
   
       # Read the instruction and argument (if any).
       #
       ($instruction, $arg) = split(/\s+/, $_);
   
       # Process one of the four instructions:
   
       if ($instruction eq 'push') {
   
           push @stack, $arg;
   
       } elsif ($instruction eq 'mul') {
   
               my $a = pop @stack;
               my $b = pop @stack;
               push @stack, $a * $b;
   
       } elsif ($instruction eq 'add') {
   
               my $a = pop @stack;
               my $b = pop @stack;
               push @stack, $a + $b;
   
       } elsif ($instruction eq 'print') {
   
               my $a = pop @stack;
               print "$a\n";
   
       }
   
   }
   
   __DATA__
   push 2
   push 3
   mul
   print

This may look simple now, but things can get a little tricky. Think about how you might evaluate print 2 + 3 * 4;. Why does the following code produce the incorrect answer?

 
 push2 
 push 3
 add
 push 4
 mul
 print
 
 

How would you correct it in a systematic fashion that could be applied to other expressions? We'll see the answer shortly, when we consider how a parse tree works.

What's In a Class File?

A class file is a little more complicated than you might expect. Figure 2 shows a summary of all the different parts of a class file. There is quite a bit of stuff to keep track of, such as:

Magic Number

The first four bytes of the file spell out the hexadecimal number 0xCAFEBABE. This lets programs such as /usr/bin/file figure out what type of file it's dealing with.

Major and Minor Version

These identify the version of virtual machine the classfile was compiled for. This works out to the number 45.3, and probably hasn't changed since the first release of the JDK. While the Java language and supporting class libraries have evolved quite a bit, the Java Virtual Machine itself has been relatively stable.

Constant Pool Count- The size of the constant pool.

Constant Pool - All of the constants belonging to the class.

Access Flags - A flag that specifies whether the class is publicly visible, an abstract class, and other class attributes.

This class - This is an index to the location of the class name in the constant pool.

Superclass - This is an index to the location of the superclass name in the constant pool.

Interface count - The number of interfaces implemented by this class.

Interfaces - All of the interfaces this class implements.

Field count - The number of fields this class has. Fields are also known as properties or data members.

Fields - All the fields of this class.

Methods count - The number of methods of this class.

Methods count - All the methods of this class.

Attributes count - The number of attributes this class has. The should not be confused with fields. Attributes store extended class information, such as the location of the original source file.

Attributes - All the attributes of this class.

Fortunately, you don't need to worry about the structure of a class file if you want to generate your own class files. Jasmin is capable of reading assembler, and generates a class file for you.

Bytecode as Machine Language

The parts of the class file that are most important to us are the machine instructions that make your programs go. In the example we saw earlier, these machine instructions took the form of names (push, add, mul, print) that are used the same way assembler code is used. Assembler code is made up of human-readable mnemonics;the idea here is that it's easier to remember that imul multiplies numbers than to remember that 0x68 does the same thing.

An assembler is a tool that converts these mnemonics into machine instructions, or opcodes. For example, the JVM's imul (integer multiply) instruction is 0x68 in hexadecimal, but 0x68 isn't the hexadecimal value of the text string "imul". The only relationship between "imul" and 0x68 is that by convention, you and the assembler have agreed that you'll represent 0x68 as "imul". In fact, because there is no standard Java assembler code; another assembler might choose to implement 0x68 as INT_MULT or some such thing.

How is an assembler different than a compiler? The main difference between an assembler and a compiler is that a compiler interprets the meaning of some source code, and an assembler makes a direct translation from mnemonic assembler code to machine code.

For example, a compiler knows that 2 + 3 * 4 means that the 3 and the 4 should be multiplied before the 2 is added. An assembler knows no such thing about the code. It assumes that statements will be executed in sequence. We'll look at some of the things that a compiler does shortly.

Jasmin: an Assembler for Java

Jasmin is an assembler for Java that takes a lot of the hassle out of developing classes in Java machine code. For the most part, all you have to worry about is writing the assembler code. To generate a class, feed the assembler into Jasmin, and then run java on the resulting class.

Here's a short listing of Jasmin assembler for the first assembler program we looked at earlier:

   1  .class public sample
   2  .super java/lang/Object

   3  .method public static main([Ljava/lang/String;)V
   4      .limit stack 3
   5      .limit locals 0

   6      getstatic java/lang/System/out Ljava/io/PrintStream;
  
   7      sipush 2
   8      sipush 3
   9      imul
   
  10      invokestatic
          java/lang/String/valueOf(I)Ljava/lang/String;
  11      invokevirtual 
          java/io/PrintStream/println(Ljava/lang/String;)V

  12      return
  13  .end method

Lines 1-2 define the name of the class and its immediate superclass.

Lines 3-6 define the public static void main() method, setting up the size of its stack (3) and the number of local variables (0). From here to line 13, all the code is contained within this method, which is automatically called if you load the class with the Java interpreter.

Line 6 pushes the object java.lang.System.out on the stack. This is Java's equivalent of STDOUT.

Lines 7-9 is the Java equivalent of:

push 2    ; push 2 onto the stack
push 3    ; push 3 onto the stack
mul       ; multiply the top two stack items

which we saw earlier as our sample input to a mythical virtual machine.

Lines 10-11 convert the integer on the stack to a string, and invoke the println() method of the object we pushed in line 6.

Lines 12-13 return from the method and end it.

Using Jasmin

To assemble this program, simply run the command:

jasmin sample.asm

This creates the file sample.class:

bash-2.01$ ls -l sample.class
-rw-r. r. 1 bjepson bjepson 371 Oct 12 12:56 sample.class
bash-2.01$ file sample.class
sample.class: compiled Java class data, version 45.3

You can run the class by feeding it into the Java interpreter:

bash-2.01$ java sample
6

Jasmin and documentation is included in Jon Meyer and Troy Downing's Java Virtual Machine, published by O'Reilly & Associates. Jasmin is also available from http://mrl.nyu.edu/meyer/jasmin/.

Growing a Tree

Let's return to the question I posed earlier. Is there a systematic way to correctly handle a case like print 2 + 3 * 4;? Upon inspection, it's easy to see that the correct answer is:

push 3
push 4
mul
push 2
add
print

But how do you know that? How would you handle a more complicated case, such as print 2 + 3 * 4 * (6 + 3 * 2);? I'm sure you could work out the answer on paper, but we need a technique that can get this right automatically, without human intervention.

This is where the front end comes in. The front end of the compiler parses the source code and passes a representation of that source code to a back end.

From Code to Tree

   Rule	                 Result
 PRINT-STATEMENT	PRINT EXPRESSION SEMICOLON
 EXPRESSION(1)	    	PRINT INTEGER OPERATOR EXPRESSION
 EXPRESSION(1)	    	PRINT INTEGER OPERATOR INTEGER OPERATOR EXPRESSION
 EXPRESSION(3)	    	PRINT INTEGER OPERATOR INTEGER OPERATOR INTEGER
 Simplified	        PRINT 2 + 3 * 4
 

There are three components of the front end that we need to know about: the lexer, the parser, and the semantic analyzer. Whether these are distinct modules within the Perl compiler is not important (and I don't know enough about the Perl compiler to tell you, anyway).What is important is that these components are included in most compilers.

In this section, we'll look at how you might implement something that can parse a command like print 2 + 3 * 4; This is not meant to be an accurate portrayal of how Perl handles this, but it is meant to give you enough information to understand how a back end like B::JVM::Jasmin works.

This discussion is deliberately simplified. It ignores the fact that any decent front end would fold 2 + 3 * 4 into 14 before handing it off to the back end. There are a number of other simplifications, but these have been made to make it easy to explain the issues involved with a compiler.

The lexer is responsible for reading a stream of source code (such as a program file or something passed in through -e), and breaking it up into tokens. A token is often a language keyword (such as print), a string (such as "hello"), or a number (such as 2). As the lexer reads the source file, it sends a series of tokens to the parser. When the lexer sees a statement like print 2 + 3 * 4;, it tells the parser something like:

Here comes a statement:
Token: PRINT
Token: INTEGER(2)
Token: OPERATOR (+)
Token: INTEGER (3)
Token: OPERATOR (*)
Token: INTEGER (4)
Token: SEMICOLON

The parser, in turn, matches this to some rules, such as:

EXPRESSION(1) : = INTEGER OPERATOR EXPRESSION
EXPRESSION(2) : = EXPRESSION OPERATOR INTEGER
EXPRESSION(3) : = INTEGER
PRINT-STATEMENT : = PRINT EXPRESSION SEMICOLON

A parser also often knows something about arithmetic precedence. It knows that multiplication should be handled before addition, so as it reads these rules, it applies them in the fashion shown just above.

This seems kind of stupid. We just applied a bunch of rules and got back what we put in. What's the point?

   	Rule	          Result
  PRINT-STATEMENT	PRINT EXPRESSION SEMICOLON
  EXPRESSION(2)		PRINT EXPRESSION OPERATOR INTEGER
  EXPRESSION(1)		PRINT INTEGER OPERATOR EXPRESSION OPERATOR INTEGER
  EXPRESSION(2)		PRINT INTEGER OPERATOR INTEGER OPERATOR INTEGER
  Simplified		PRINT 2 * 3 + 4
 

The point is the choice of rules and the order in which the rules were applied. The application of these rules preserves the operator precedence. Take a look at Figure 3, which depicts the parse tree for this expression.

figure three

Figure 3: the parse tree for this expression

Because 3 * 4 is stuck in its own little branch off to the lower right, we know that they will be evaluated before the addition is performed. If the expression had been 2 * 3 + 4, the rules would be applied in the table show to the left.

Figure 4 shows the parse tree for print 2 * 3 + 4;. Semantic Analysis. After the parser has had its way with the input, it passes the tree on to other parts of the front end for semantic analysis. Although a parser can handle simple things like operator precedence, the semantic analyzer is needed to interpret more complex aspects of a language, such as whether a value should be interpreted in a scalar or list context.

figure 4

Figure 4: the parse tree for print 2 * 3 + 4;

In this article, we're not going to see any examples of semantic analysis, but it's an important enough part of the front end to merit mentioning.

Figure 5 shows a now leaner version of the tree that preserves all the information we need.

figure 5

Figure 5: a now leaner version of the tree

Walking the Tree

The compiler back end "walks" the tree produced by the front end; its path is shown by the arrows in Figure 5. How does the syntax tree preserve information

about the precedence of operators? Because the tree isolates certain operations (such as multiplication) when other operations (such as addition), are dependent upon their completion. In this case, 2 + EXPRESSION is dependent on the completion of 3 * 4.

The Story of O

O is a module that is part of the Perl compiler. It's essentially a controller for various back ends; to use any compiler back end, you need to use O. Here's an example of how you can use O to activate the B::Terse back end, which walks the Perl syntax tree and prints out some brief information about what's going on. Note that in order to avoid 2 + 3 * 4 being folded into a single constant, I've used variables to refer to the numbers.

bash-2.01$ perl -MO=Terse -e 'print $two + $three * $four;'
   LISTOP (0xb4a18) leave
       OP (0xb4a40) enter
       COP (0xb49e0) nextstate
       LISTOP (0xb4998) print
           OP (0xb49c0) pushmark
           BINOP (0x27be0) add [2]
               UNOP (0xb4758) null [15]
                   GVOP (0x37880) gvsv  GV (0x2903c) *two
               BINOP (0x27ba0) multiply [1]
                   UNOP (0xb4868) null [15]
                       GVOP (0xb4778) gvsv  GV (0x28fdc) *three
                   UNOP (0xb4978) null [15]
                       GVOP (0xb4888) gvsv  GV (0x28fc4) *four

This is a slightly different way of looking at the parse tree. But as the last five lines indicate, the multiplication is isolated as we wanted.

From Perl to Java

B::JVM::Jasmin is a back end to the Perl compiler; it emits a Jasmin source file. You can feed this file into Jasmin and get a Java class file as a result. B::JVM::Jasmin is essentially a proof-of-concept work. It shows that it's possible to generate Java bytecode from Perl, but it doesn't go far enough, because it doesn't handle advanced features of Perl.

The O module talks to a back end by calling the back end's compile() method. The compile() method in B::JVM::Jasmin opens up the output file, and calls the mywalk() method, which walks the Perl parse tree.

As Jasmin walks the parse tree, it encounters various opcodes, which roughly correspond to what we saw in the parse tree. Every opcode can have up to two corresponding methods that handle it: the pre_action, which is executed before mywalk() handles any of that opcode's children, and the post_action, which is executed after the children are handled.

Here is a summary of what mywalk() does for every node it encounters on the parse tree:

Execute the pre_action, and store its output.
Execute the post_action, and store its output.
Emit the output of the pre_action.
If the current node has children:
   For each child:
      Call mywalk() on this child.
Emit the output of the post_action.

B::JVM::Jasmin only pays attention to a subset of the full tree that the compiler provides, shown below as a subset of the B::Terse output we saw earlier:

  LISTOP (0xb4998) print
        BINOP (0x27be0) add [2]
                GVOP (0x37880) gvsv  GV (0x2903c) *two
            BINOP (0x27ba0) multiply [1]
                    GVOP (0xb4778) gvsv  GV (0x28fdc) *three
                    GVOP (0xb4888) gvsv  GV (0x28fc4) *four

Before we can get any useful output from B::JVM::Jasmin, let's expand our example ( print $two + $three * $four;) to assign some values to the three variables. (Remember, we are using variables instead of constants only to keep Perl from folding them into one number at compile time).


 $two = 2;
 $three = 3;
 $four = 4;
 print $two + $three * $four;

Figure 6 shows what B::JVM::Jasmin does with the first three lines; the resulting source code for Jasmin follows. Note that B::JVM::Jasmin automates storage for local variables, by assigning the first variable it encounters to JVM local #1, the second to JVM local #2, and so forth:

figure 6

Figure 6: what B::JVM::Jasmin does with the first three lines;


   sipush 2  ; push the value 2
   istore_1  ; store it in local variable 1 (corresponds to $two)
   sipush 3  ; push the value 3
   istore_2  ; store it in local variable 2 ($three)
   sipush 4  ; push the value 4
   istore_3  ; store it in local variable 3 ($four)

Figure 7 shows how the pre_action and post_action are executed for the remainder of the program. Notice how the print opcode sets up the PrintStream object on the way in, and then expects a value on the stack on the way out as it invokes println(). Here is the output:

figure 7

Figure 7: shows the pre_action and post_action


    1  getstatic java/lang/System/out Ljava/io/PrintStream;
    2  iload_1   
    3  iload_2   
    4  iload_3 
    5  imul   
    6  iadd      
    7  invokestatic 
       java/lang/String/valueOf(I)Ljava/lang/String;
    8  invokevirtual 
       java/io/PrintStream/print(Ljava/lang/String;)V

Line 1 pushes the object java.lang.System.out on the stack (Java's STDOUT).

Line 2, 3, and 4 fetch the contents of $two, $three, and $four.

Line 5 multiplies the top two items ($four and $three).

Line 6 adds the top two items ($two and 12).

Lines 7 and 8 convert the integer on the stack to a string, and invoke the println() method of the object we pushed in line 1.

It's Alive!

When we pump B::JVM::Jasmin's output through Jasmin and then the JVM, we see the correct output: 14.

Summary

Compilation is tricky business. Thankfully, someone's done some of the hard work for us. Malcolm Beattie's Perl Compiler handles the front end, and Jon Meyer's Jasmin takes care of part of the back end's work. There still is much to be done, though. B::JVM::Jasmin is just a small part, and doesn't take into account things like subroutines, modules, classes, regular expressions, and many other things. To a certain extent, some of these ideas will port over to Java well. Others, like regular expressions and multiple inheritance, will be tricky.

What does the future hold? All eyes of the JPL mailing list are upon Bradley M. Kuhn, who is working on implementing something a little more advanced than B::JVM::Jasmin as part of his graduate research. B::JVM::Jasmin was useful as a prototype, but future work in this area will probably be built using B::JVM::Jasmin as a reference, rather than a foundation. That's one of the best things a prototype can hope for!

__END__


Brian Jepson is the Minister of Information for the SMT Computing Society, a ragtag band of enthusiasts from Rhode Island. When he's not working on a book or some random piece of software, he's probably doing something predictable, like throwing televisions from the top of regional landmarks, or encouraging others to do such things for his entertainment.


PREVIOUS  TABLE OF CONTENTS  NEXT