PREVIOUS  TABLE OF CONTENTS  NEXT 

Best of Both Worlds: Embedding Perl in C

Doug MacEachern

The notion of using one computer language from another is, if you think about it, a somewhat frightening notion. After all, getting one language to do what you want is hard enough; getting two languages to do what you want and play nice with each other sounds exponentially more troublesome.

But it's not that painful. And even if it were, it would sometimes be worth the suffering to, say, use Perl's pattern matching capabilities from C.

Perl is itself a C program, but this is something you normally don't need to know. Let's think about the basics for a moment. Perl's main() contains a mere 20 or so lines of code in its body. The minimal functionality required to run a Perl script inside a C program consists of just six lines of code: function calls that allocate a Perl interpreter, parse the script, run the script and deallocate the interpreter. Simple, eh?

Let's walk through these six lines and see what's going on.

/* allocate the interpreter object */ 

PerlInterpreter *my_perl = perl_alloc(); 

/* 
 * initialize the interpreter, some internal 
 * global variables, and the stack 
 */ 

perl_construct(my_perl);

/* 
 * initialize some global variables (@ARGV, %INC, etc.)
 * compile the code into perl's internal format 
 * or "parse tree" 
 * 
 * xs_init is a function pointer which is normally
 * used to create the Perl entry point to C libraries,
 * e.g. DynaLoader and static extensions' "bootstrap"
 * methods. argc is the number of elements in the argv
 * array; argv[] contains what is normally the command
 * line arguments passed to the interpreter, e.g.
 *   "/usr/bin/perl", "-e", "print qq(Hello world!)"
 * The final argument may be a char ** array of
 * environment variables. NULL means the current env.
 */

perl_parse(my_perl, xs_init, argc, argv, (char **)NULL);

/* 
 * interpret the script (walk the parse tree) 
 */

perl_run(my_perl);

/* 
 * "destroy" objects and free memory allocated for 
 * the script's symbols, etc. 
 */

perl_destruct(my_perl);

/* 
 * free memory allocated for the PerlInterpreter object itself 
 */

perl_free(my_perl);

That wasn't so bad, was it? But you might still be wondering why you'd want to do this.

Naturally, if your applications are written entirely in Perl, there's no need to embed your own Perl interpreter. You could, in theory, but it wouldn't make much sense. Perl already lets you execute separate Perl code from your programs: see eval() and use and require. But if you want to use Perl's features from C or C++, you can embed a Perl interpreter in your program and have it do as much or as little of the work as you'd like. Of course, you can always have your C program talk to a Perl script through pipes. That works, but it's slow: it requires a separate process.

Depending on the size of your Perl script, you may also be hurt by the time it takes to parse the script. Embedding a Perl interpreter in your program avoids this overhead.

Think of C as the concrete foundation of your house. It's sturdy. Stable. Boring. Every house has a foundation, and all the foundations look pretty much the same. Your house is Perl; it's quirky, artistic, and a good deal more flexible than your foundation. But without the foundation, your house wouldn't be there.

Sometimes, when you want to make a radical change to your house, you'll need to drill into the foundation and use C from Perl. Your drill is called XS, which will be the topic of a future article in TPJ. Perl is the sweet outside, C is the crunchy cementy inside. That's what I call using Perl "inside out."

When your foundation is sagging, and you need to reinforce it with steel risers extending into your house frame, that's using Perl "outside in." The tool is embedding, and the result is a more innovative and flexible foundation.

Embedding for Regex Processing

I don't have to tell you how rich and powerful Perl's regular expression engine is. Let's squeeze together some examples from Jon Orwant's perlembed documentation to show you what I mean. The example on the next page embeds a Perl interpreter along with a script containing a subroutine for performing substitutions and matches à la m// and s///.

We create an interpreter, parsing a script called 'regex.pl'. Inside the script is a subroutine named 'regex' which performs the substitution or match. We invoke this subroutine with the Perl API function perl_call_pv(), passing it two arguments: the string we want to modify and the operation to perform. Here's regex.pl.

use strict;
use vars qw(@Matches);

sub regex {
    my($string, $operation) = @_;
    my $n;

    # hold matches in an array accessible to our C program

    @Matches = ();

    # use eval to handle m//, s/// and tr///

    # if we're using m//

    if($operation =~ m:^m:) {
        eval "\@Matches = (\$string =~ $operation)";
        $n = scalar @Matches;
    }
    else {
        eval "\$n = (\$string =~ $operation)";
    }
    return($n, $string);
}

The program regex.c is shown further below.

We then fetch the return values of the subroutine call which are the the number of matches and the possibly modified string. In addition, if we attempt a pattern match with m//g, the matches are stored in the array @Matches. We can examine this array first by getting a pointer to it with the perl_get_av() function and then accessing each element with the av_shift() function. Each element of the array is an SV* (scalar value), which we convert to a string using the SvPV macro.

To compile the program, use your C compiler and the ExtUtils::Embed module:

% cc -o regex regex.c 'perl -MExtUtils::Embed -e ccopts -e ldopts'

Now run the program:

% regex "You see, Perl and C are family" 
Perl family 

Sorry, can't replace Perl with C.

regex: s/[aeiou]//gi...10 substitutions made. 
Now text is: Y s, Prl nd C r fmly

This is just a sample of how your C program could take advantage of Perl's regex support. Play around and see what else you can do!

In case you are wondering about ExtUtils::Embed, it contains some of the information your C compiler needs to embed Perl: locations of header files and libraries, and compilation flags. In order for your C program to access Perl's API, it needs to include Perl's header files and link with the Perl runtime library. And there will inevitably be other flags required by your C compiler and linker if your embedded interpreter is to do everything your Perl executable can. This information will differ depending on your system and version of Perl; to see what your setup needs, try this:

perl -MExtUtils::Embed -e ccopts -e ldopts

and you'll see what I mean.

regex.c

/* regex.c -- derived from the perlembed documentation */ 
#include <stdio.h> 
#include <EXTERN.h> 
#include <perl.h> 
static PerlInterpreter *my_perl;

/** regex(string, operation) 
 ** 
 ** Used for =~ operations 
 ** 
 ** Returns the number of successful matches, and 
 ** modifies the input string if there were any. 
 **/ 
static int regex(char *string[], char *operation) { 
  int n;

  dSP;                                      /* initialize stack pointer */ 
  ENTER;                                    /* everything created after here */   
  SAVETMPS;                                 /* ...is a temporary variable. */   
  PUSHMARK(sp);                             /* remember the stack pointer */   
  XPUSHs(sv_2mortal(newSVpv(*string,0)));   /* push the string onto the stack */ 
  XPUSHs(sv_2mortal(newSVpv(operation,0))); /* push the operation onto stack */ 
  PUTBACK;                                  /* make local stack pointer global */   
  perl_call_pv("regex", G_ARRAY);           /* call the function */   
  SPAGAIN;                                  /* refresh stack pointer */

  *string = POPp;                           /* fetch the perhaps modified string */ 
  n = POPi;                                 /* fetch the number of substitutions made */

  PUTBACK; 
  FREETMPS;                                 /* free that return value */ 
  LEAVE;                                    /* ...and the XPUSHed "mortal" args.*/

  return n;                                 /* the number of substitutions made */ 
}

main (int argc, char **argv, char **env) 
{ 
  char *embedding[] = { "", "regex.pl" }; 
  char *text; 
  int num_matches;

  if (argc < 2) { fprintf(stderr, "usage: regex <string>\n"); exit(1); }

  text = argv[1];
  my_perl = perl_alloc(); 
  perl_construct( my_perl ); 
  perl_parse(my_perl, NULL, 2, embedding, NULL);

  if(num_matches = regex(&text, "m/([a-z]{4,6})/gi")) { 
    AV *array; 
    SV *match; 
    STRLEN len; 
    int i;

    /* get a pointer to the @Matches array */ 

    array = perl_get_av("Matches", FALSE);

    /* take a look at each element of the @Matches array */

    for(i=0; i < num_matches; i++) { 
      /* just like '$match = shift @Matches;' */
      match = av_shift(array);     
      printf("%s ", SvPV(match, len)); 
    } 
    printf("\n\n"); 
  } 

  /** Remove all vowels from text **/ 

  num_matches = regex(&text, "s/[aeiou]//gi"); 
  if (num_matches) { 
    printf("regex: s/[aeiou]//gi...%d substitutions made.\n",
            num_matches); 
    printf("Now text is: %s\n\n", text); 
  }

  /** Can we replace Perl with C?? **/ 

  if (!regex(&text, "s/Perl/C/")) { 
    printf("Sorry, can't replace Perl with C.\n\n"); 
  } 

  perl_destruct(my_perl); 
  perl_free(my_perl); 
}

Embedding for URL Parsing

When you install Perl, all of your configuration information is saved in the Config module, where ExtUtils::Embed can find what it needs. In addition, it can take care of the work needed if your embedded scripts wish to use modules, which themselves use C libraries.

Regular expressions are wonderful, but there are plenty of other reasons you might want to embed a Perl interpreter - the CPAN alone has about 150. As a simple example, let's use the URI::URL module from the LWP distribution. URI::URL uses regular expressions underneath to parse and synthesize URLs, but this is transparent to the user. What we are showing here is simply how to embed a module, construct an object, and use it to invoke methods.

As always, we start by constructing a Perl interpreter object. But this time we're not parsing a script. Instead, we pass the -M switch with an argument of URI::URL, which tells Perl to use URI::URL. We then generate some code to construct a new URI::URL object, using the function perl_eval_sv() to compile the code. URI::URL objects may need to inherit methods, so we invoke methods in our C program with the perl_call_method() function, which will look up the subroutine via the object's @ISA tree.

Perl itself uses perl_call_method() to implement the tie() mechanism (see the perltie documentation).

 
#include <stdio.h> 
#include <EXTERN.h> 
#include <perl.h>

static PerlInterpreter *my_perl;

SV *new_uri_url(char *url, char *base) 
{ 
  char code[200];

  /* code to construct a new URI::URL object */
  sprintf(code, "$url = new URI::URL qw(%s,%s);", url, base);

  /* eval the code */ 
  perl_eval_sv(newSVpv(code,0), G_DISCARD | G_SCALAR);

  /* fetch the $url object */ 
  return perl_get_sv("url", FALSE); 
}

char *url_method(SV *obj, char *method, char *arg) 
{ 
  char *ret; 
  int count;

  dSP;                                      /* initialize stack pointer */ 
  ENTER;                                    /* everything created after here */ 
  SAVETMPS;                                 /* ...is a temporary variable. */ 
  PUSHMARK(sp);                             /* remember the stack pointer */
  XPUSHs(sv_2mortal(newSVsv(obj)));         /* push the url object onto the stack */ 
  XPUSHs(sv_2mortal(newSVpv(arg,0)));       /* push an optional arg onto the stack */ 
  PUTBACK;                                  /* make local stack pointer global */ 
  count = perl_call_method(method, G_SCALAR); /* call the method */ 
  SPAGAIN;                                  /* refresh stack pointer */ 

  ret = POPp; /* pop the return value from stack */
  
  PUTBACK; 
  FREETMPS; /* free that return value */ 
  LEAVE; /* ...and the XPUSHed "mortal" args.*/ 
  return ret; 
}

int main (int argc, char **argv, char **env) 
{ 
  char *embedding[] = {"", "-MURI::URL", "-e", "0"};
  SV *url; 
  char *ret;
  if (argc < 3) { 
    fprintf(stderr, "usage: uri-url <url> <method> [<base>]\n");
    exit(1);
  }

  url = sv_newmortal();

  my_perl = perl_alloc(); 
  perl_construct( my_perl );
  perl_parse(my_perl, NULL, 4, embedding, NULL); 

  /* argv[1] is the fully qualified or relative url */   
  /* argv[2] is the URI::URL method to call */ 
  /* argv[3] is the url base if needed to resolve relative url's */
  url = new_uri_url(argv[1], (argc==4) ? argv[3] : NULL);
  ret = url_method(url, argv[2], NULL);
  printf ("%s = '%s'\n", argv[2], ret);

  perl_destruct(my_perl); 
  perl_free(my_perl); 
}

Compile with cc -o uri-url uri-url.c 'perl -MExtUtils::Embed -e ccopts -e ldopts'

...and give it a try:

% uri-url http://www.perl.com/CPAN/ host 

host = 'www.perl.com'

We've seen some examples showing how and why C applications might want to embed a Perl interpreter. Now, let's take a look at some real-life applications that actually do.

Scripting languages were added to the nvi text editor to allow for user customizable extensions without having to recompile nvi. Instead of building a new language from scratch, the developers chose to support existing languages. One such language is Perl, so you can run Perl commands directly from within nvi.

In a non-interactive program, there is no way to avoid the possible penalty of script parse time. However, in an interactive application such as trn or nvi, Perl code can be parsed when the program starts.

Embedding for HTTP

Many interactions with HTTP servers are simple requests for static HTML document. More complex and exciting requests usually require an application (e.g. Perl) external to the server. The standard way for a server to communicate with another program is via CGI (the Common Gateway Interface). Designed for simplicity, CGI is plagued by statelessness, the overhead of starting a new process for each request, and the start-up time of large Perl programs.

Early Web servers such as CERN and NCSA have no way to avoid these problems - it's CGI or bust, because they're not extensible. The first extensible server I came across was Tony Sanders' Plexus, written entirely in Perl. It provides a CGI interface, but also lets you hook in your own custom Perl gateways. These hooks make it possible to avoid the additional fork/exec required to run a CGI program, along with parsing any Perl code you choose when the server starts. However, as HTTP servers have evolved, Plexus has been at a standstill, so sadly we must move on.

When Netscape introduced the first commercial HTTP server, it offered a mechanism to avoid CGI overhead via a server API (NSAPI). It sounded exciting, but it was big money at the time, plus the perlembed documentation back then was simply "Look at perlmain.c and do something like that."

When I started working at OSF, I was introduced to piece of their technology called the "strand" (Stream Transducer Daemon), which acts as a proxy server for filtering an HTTP stream. That is, a strand program sits in between the client (your browser) and the "real" server to which you make the request. The server itself is written in C, which handles the processes, network connections, protocols, etc. The stream is then handed to another process, hooking up the client socket to the program's STDIN and the server socket to STDOUT. Naturally, for parsing and manipulating the HTTP requests and responses, Perl quickly became the language of choice. As most of these applications where rather complex, each transaction was plagued with the same burdens as seen with CGI. Turns out, it was simple to embed a Perl interpreter in the server, avoiding the fork/exec and parse time. What a difference in speed!

So it was simple enough to embed an interpreter in the strand servers - why not try it in an HTTP server? After looking at the Apache API, it was clear this was possible. Apache's (The Apache project is provides a public domain HTTP server that is secure, efficient, extensible, and in sync with the current HTTP standards. Apache project information, including API documentation, is available at http://www.apache.org/docs/API.html.) API provides a handful of hooks, allowing a module written in C to be linked with the server and then step in at any stage of the HTTP request, including authentication, URL re-writing, and handling the response itself.

Sure enough, Gisle Aas introduced mod_perl, a proof-of-concept Apache module that embeds a Perl interpreter so the server could run Perl scripts natively (read: quickly). The original module handled any request of type 'httpd/perl', calling the six basic functions described earlier: perl_alloc(), perl_construct(), perl_parse(), perl_run(), perl_destruct() and finally perl_free(). The module also provides a Perl interface to the server's C API, rather than painfully setting up a CGI environment. There was a noticeable performance improvement, but it was possible to do even more. Gisle was too busy with other projects (such as the great LWP package, described on page 21), so I offered to pick up where he left off.

The new approach implemented in the mod_perl module steps in when the server starts, at which time it allocates one perl interpreter by calling perl_alloc() and perl_construct(). A configurable 'PerlScript' file is loaded by invoking perl_parse(). 'PerlScript' need not exist, as you can load modules into the parse tree with the 'PerlModule' directive. When the server starts, it also invokes perl_run(), to take care of any other initialization your PerlScript or PerlModules may need. If the server is restarted, perl_destruct() and perl_free() are called to clean up before going through the steps described above.

OK, we've already gone though each of six required steps as described earlier, but we haven't yet handled our first HTTP request. On a per-directory basis, a 'PerlHandler' configuration directive specifies the subroutine to call in memory for handling each request. This is done with the Perl API function perl_call_pv(). As you've seen, this is one of a few ways for a C program to make a "callback" into the embeddded Perl code We also take advantage of the G_EVAL option, passing this flag to perl_call_pv(). That traps any errors that occur during the callback. We can then send the proper HTTP error code to the client and use standard server mechanisms to log the error message, which will include the value of the $@ variable obtained with gv_fetchpv().

Since we're avoiding the CGI overhead, you'll encounter some differences between CGI scripts and mod_perl. The Apache API simply provides the hooks for extending/interfacing to the server, which by itself has nothing to do with the CGI implementation. In fact, the Apache CGI implementation is a module itself (mod_cgi), which sets up the standard environment and executes the program.

In order to integrate embedded scripts into a server, the scripts need to go back inside through the server's C API, just like any Apache module. Scripts and modules running under mod_perl are, in a sense, sitting right next to the API function; they just need a way to hook into them directly. Perl provides an interface to C language functions via a language called XS, which acts as glue between Perl functions and their C counterparts, converting Perl's scalar data to the appropriate C data types and back again. The xsubpp program generates C code from your .xs file where which can then be built by any C compiler. This language makes it possible to provide a Perl interface to the Apache API and its request data structure. But first we'll need two changes to the server configuration. First,

PerlModule MyPackage

tells the embedded interpreter to require MyPackage when the server starts, using perl_eval_sv(). We'll also need

PerlHandler MyPackage::handler

which tells the embedded interpreter which subroutine to call during the request. It's configurable on a per-directory basis. And here's the Perl code:

package MyPackage; 
use strict; 

# "bootstrap" Perl interface to the server's C API,
# creating Perl methods that convert Perl data types
# to whatever the Apache API needs, and vice versa.
use Apache (); 

# Give Perl access to the httpd.h constants
# return codes, OK, etc.
use Apache::Constants ':common'; 

# This subroutine is called by the embedded 
# interpreter with perl_call_pv(), pushing one
# argument onto the stack: a pointer to the
# request_rec data structure, needed by most Apache
# API function calls. 
sub handler { 
    my($req) = @_; #pointer to request structure
    # Method to call Apache's get_remote_host() 
    my $host = $req->get_remote_host; 

    # Set the Content/Type header for the client 
    $req->content_type("text/html");

    #send the HTTP response headers to the client     $req->send_http_header; 

    #send some data to the client 
    $req->write_client("Hey you from $host! <br>\n");
    return OK; 
}
1;

As you can see, this is somewhat different from a CGI script, since we're at a lower level in the server. What you might not realize is that the Apache CGI module (mod_cgi) communicates with the server and browser the same way. It's possible to provide a higher level interface so things look more like the CGI environment - for example, by subclassing CGI.pm. In addition, we've specified Apache::Registry as the PerlModule and Apache::Registry::handler as the PerlHandler in our server configuration files. This module wraps your code into a subroutine that is compiled only if it has changed on disk, and then calls the subroutine for you during each request.

use CGI::Switch ();

my $query = new CGI::Switch; 
my $host = $query->remote_host; 	 
$query->print($query->header,"Hey you from $host!<br>\n");

This script runs identically from the command line, under CGI, and embedded in Apache.

It still looks a little funny: what's this $query->print() all about? I/O between Apache and the browser is not stream oriented: it only appears that way to CGI scripts, since the CGI implementation uses pipes to send and receive data to and from the child process. As a result, a print() to STDOUT does not go to the client, nor does a read() from STDIN read from the client. Normally, we'd need a method to send data through the server API write functions.

However, with the new PerlIO abstraction and sfio, it is possible to redirect Perl's I/O streams through Apache's I/O abstraction and to the client in a clean way. With this, and two configuration directives turned on to send HTTP headers automatically and initialize %ENV, it's possible to write a script that looks just like a "normal" CGI script.

  #!/usr/bin/perl
  print <<"EOF"; 
  Content-type: text/html
  Hey you from $ENV{REMOTE_HOST}! <br>
  EOF

At this point, you can write Apache modules in Perl without even knowing that your modules are embedded in the server! The intent of the Apache/Perl modules is not merely to provide a mechanism for faster CGI execution. Since we started from the bottom, providing a direct interface to the server C API, it is possible to write Apache modules in Perl to extend the server in many ways such as:

and whatever else you can dream up.

If you'd like to try using Perl outside in, a good place to start is the perlembed manpage and the ExtUtils::Embed package. Start with the perlxstut manpage and the CookBook[A|B] modules if you'd like to try using Perl inside out.

Remember the perlembed moral: "You can sometimes write faster code in C, but you can always write code faster in Perl. Since you can use each from the other, combine them as you wish."

__END__


Doug MacEachern is an engineer at The Open Group Research Institute.


PREVIOUS  TABLE OF CONTENTS  NEXT