PREVIOUS  TABLE OF CONTENTS  NEXT 

Genetic Algorithms with Perl

Brad Murray and Ken Williams

Packages Used
Storable:           http://www.perl.com/CPAN

In most programming tasks, we humans carefully state the problem, code an algorithm to solve the problem, and turn the computer loose to run the algorithm. Then we hope the correct solution pops out, and if it doesn't we send an irate bug report to the people who developed our programming language.

Genetic algorithms (GAs) offer a different approach; we still carefully state the problem, but we let the computer find a suitable algorithm and apply that algorithm to the problem to produce a solution. Generating algorithms programmatically usually means working with code as data, which has traditionally left it more in the realm of LISP and related languages. But Perl can also generate and evaluate code on the fly, so it is also capable of handling generalized GAs.

The core principle behind a GA is that of evolution. You start with a set of random organisms, each of which is a program. You then run each of these programs and determine their fitness, which is the degree to which they succeed at the required task. Once the fitness of each is determined, you jump through some hoops to remove bad entries (natural selection), randomly permute some remaining entries (mutation), and intermingle other entries (hot algorithmic sex).

After repeating this process through several generations, the population will begin to converge on a solution. Often, it is not quite the solution you were expecting, and therein lies lot of the fun in building a GA engine.

The Genetic Code

The genetic code we will use to describe our organisms has many parallels to genetic code found in real plants and animals. It also has a huge number of divergences. Draw analogies at your own risk.

The GA task we'll set before ourselves is to find algebraic expressions that can generate a certain predetermined set of data points when applied to a given set of input data. We'll call these algebraic expressions our organisms, and we'll represent them as simple binary syntax trees composed of functions and terminals. Each function has two branches representing its arguments, and each argument can either be another function or a terminal. Terminals are dead-end leaf nodes and are usually constants or one of the input parameters in the problem. For instance, in an organism represented by the algorithm add(multiply(2,x),5), we have the functions add() and multiply(), and the terminals 2, x, and 5.

Each problem requires some twiddling by the Genetic Engineer. You need to determine a set of functions and terminals that will be part of your organisms' makeup. You might think of these as roughly equivalent to the various base pairs in DNA, or perhaps to various short sequences of base pairs. For the example we'll show here, we're trying to fit a function to the data:

-1407, -931, -577, -327, -163, -67, -21, -7, -7, -3, 23, 89, 213, 413, 707, 1113, 1649

...which just happens to be the output of 3x3 + 2x2 – x – 7 applied to each integer on the interval [–8,8]. To hit this (and we'll assume that we have no idea what the solution should look like) we'll use a function set that includes:

  sin(x,y)
  log(x,y)
  mul(x,y)
  add(x,y)

Now we know that the real solution only needs the last two, so the first two are there just to confuse the GA. We picked sin(x,y) because some regions on the graph look like they might be sinusoidal. We picked log(x,y) just to drive up the computing time and make things interesting.

We know that sin(x,y) and log(x,y) look odd, since sin() and log() only take one argument apiece. But by our definition, each function in our syntax tree has two branches. In the case of unary functions, we simply throw the second argument away, but it is valuable to keep it in the data structure, as a mutation may change a sin() to an add() and suddenly make the second argument interesting. As the second argument could be a whole tree of functions on its own, this could get very interesting indeed.

So given these functions, we build an array of functions as strings. Remember we said that we needed code as data? Here it is, to be eval'ed later:

  my @functions = (
  # Format is 'function <pretty version>: <actual code>'
  'function add($a,$b): ($a+$b)',
  'function mul($a,$b): ($a*$b)',
  'function sin($a,$b): sin($a)',
  'function log($a,$b): do{my $_a=$a; $_a ? log(abs($_a)) :  0}',
  );

Notice that our log() function is protected from zero or negative values. Any function that has sensitive arguments needs to be appropriately protected from them as there will be no chance to trap the errors outside of the evaluation. In this case, we need to protect it in a rather obscure way from Perl's constant-folding optimizer - try running the one-liner 0 ? log(abs(0)) : 0 to see the problem.

Next we need terminals. For this exercise we have two: x (the parameter that we will vary over the interval of interest, i.e. the input to these algebraic expressions) and a constant integer between -9 and 9. We specify these as subroutine references:

  my @terminals = (
    sub { '$x' }, # The input parameter
    sub { '$x' },
    sub { int(rand(20) - 10) },  # Some random number
    sub { int(rand(20) - 10) },
                  );

...which return either the name of the parameter ('x') or the random number. … Now, what does this code look like? Well, as we said, it's a syntax tree that looks something like so:

digraph stree {
  mul  [ label = "mul()" ];
  add  [ label = "add()" ];
  sin  [ label = "sin()" ];
  mul2 [ label = "mul()" ];
  x1   [ label = "x" ];
  x2   [ label = "x" ];
  x3   [ label = "x" ];
  c1   [ label = "-5" ];
  c2   [ label = "7" ];

  mul -> add;
  mul -> mul2;
  mul2 -> x1;
  mul2 -> x3;
  add -> sin;
  add -> c1;
  sin -> x2;
  sin -> c2; }

...which corresponds to mul(add(sin(x,7),–5),mul(x,x)) which in turn reduces to (sin(x)–5)x2, which is, of course, wrong. Let's hope evolution can help.

Assembling an Organism

We'll represent each organism as a Perl object. If making little genetic organisms isn't a good opportunity to use object-oriented programming, we don't know what is.

Now, it's one thing to know what a bridge looks like, but designing and building a bridge is something else altogether. We think it's safe to say the same is true of organisms. We'll build ours recursively, with some sanity checks.

First we check against a maximum depth value and plug in only terminals past this point. This keeps the tree from getting too crazy at the outset (it will get crazy enough later). If we are inside the maximum depth, we randomly select a function or terminal (with an arbitrary 2:1 weight towards terminals). If we selected a function, then we call the organism's _buildTree() method again to get two more nodes to use as input to the function. And so on.

  sub _buildTree {
    my $self = shift;
    my $depth = shift || 0;
  
    my %tree;
    $tree{contents} = ($depth>$maxdepth or int rand(3)) ?
                       &{$terminals[rand @terminals]} :
                       $functions[rand @functions];

    if ($tree{contents} =~ /^function /) {
      $tree{'a'} = $self->_buildTree($depth + 1);
      $tree{'b'} = $self->_buildTree($depth + 1);
    }

    return \%tree;
  } 

This builds for us a hash of nodes, each of which has three components: $tree{contents}, which contains either a terminal value (a constant or $x in this case) or a function; $tree{'a'} and $tree{'b'} are references to other nodes. If the content is a terminal, left and right pointers are not generated.

Survival of the Fittest

Just generating random organisms is not enough. We need to rank them according to their fitness, so we can decide which to cull out. We also need to determine fitness so that we know when to stop: unlike real evolution, we are trying to hit a fixed target. Evolution, however, is a feedback mechanism, and is therefore designed to hit a moving target. This means that once we reach a perfect fit for the data, the algorithm will keep trying new things even though the current fit is perfect. This will result in the program oscillating around the correct answer, which doesn't help. If you are trying to find an algorithm to hit a moving target, you still need to know the fitness at each generation, though you will probably have to do some statistical work on your results in order to find the mean success rate over the changing target.

We calculate the fitness by averaging the difference between each fixed data point and the corresponding result of the organisms' function (its phenotype). Thus fitness is a non-negative number, and a fitness of zero indicates a perfect organism. To calculate the output of the syntax tree, we have a function called fitness():

  sub fitness {
      # Determine the fitness of an organism in this crazy world
  
      my ($org, @target) = @_;
      my $sumdiff = 0;
  
      foreach (0..$#target) {
         $sumdiff += abs($org->evaluate({'x'=>$_}) - $target[$_]);
      }
      return $sumdiff/@target;
  } 

fitness() repeatedly calls the Organism's evaluate() method at points on the interval that interest us. Think of the evaluate() method as the Organism's whole reason for existence; if the Organism were a leech, the method would be called suck_blood(). If it were a gerbil, the method would be called freak_out(). The method simply applies the embedded behavior of the Organism to the given input data. In this case, we evaluate a single-variable algebraic expression for a given number.

The evaluate() method is just a simple front-end which calls Perl's eval on the result of the Organism's expression() method.

  sub evaluate {
      my $self = shift;
      my $params = shift;
  
      my $x = $params->{'x'};
      return eval $self->expression;
  }

  sub expression {  # Turn the syntax tree into a Perl expression
      my $self = shift;
      my $tree = shift || $self->{tree};
  
      # Check the cache
      return $self->{expr} if defined $self->{expr};
  
      local $_ = $tree->{contents}; # A shortcut for the current node
  
      if ( s/^function [^:]*: (.*)/$1/ ) { # Extract the Perl expression
           s/\$([a-zA-Z]+)/$self->expression($tree->{$1})/ge;
    }
  
    $self->{expr} = $_ if $tree eq $self->{tree}; # A nasty trick
    return $_;
  } 

Since expression works on a recursive data structure, it's natural that it's a recursive subroutine. If the current node represents a function, we scan the function description and pull out the Perl expression that implements that function. Then we replace any embedded variables ($a or $b) with the Perl expression their syntax tree represents. If the current node is a terminal, we leave it alone. Note that we have to go to a little bit of trouble to avoid touching the $_a variable in the logarithm function.

We happen to particularly like the way the expression() method combines its recursion and caching techniques. Building the Perl code from the syntax tree is a pretty intensive process, and we don't want to do it over and over again for the same Organism, so we cache our results. We only want to put the result in the cache when we're done with the work; that is, when we exist from the topmost call to this recursive subroutine. We detect whether that's the case by comparing the node we're currently working on to the topmost node in $self's tree. If they're the same, we cache and finish. The trick is to compare these two references as strings. When a reference is used in a string context, it contains a representation of the reference's memory address, so if the two references evaluate to the same string, they're the same reference. Sneaky. Most of the time this type of caching requires a wrapper function to handle the caching, and a recursive function to do the real work.

So now that we know how to evaluate the success or failure of each Organism, we need to do something about it.

Sex and Mutation

It won't be enough to just throw out the bottom half of the list and generate a new set of random organisms from scratch. This would be akin to trying to make a watch from a bucket of parts by shaking the bucket until a watch spontaneously assembles. Evolution doesn't work that way, and we don't have enough time to wait for pure randomness to cough up the algebra we want.

So what we do is rank the organisms by fitness and perform three operations on the list:

Culling is simple: we just unshift the same number of organisms that we are going to add by mating. In the case of our example, we have set the parameter $mates to 5, so we remove that many individuals. In point of fact what we actually do is mate and pop five times. Same thing.

Mutating is also pretty straightforward: we grab an organism at random and mutate it with its mutate() method:

sub mutate {
  my $self = shift;

  my @index = $self->_treeIndex;
  %{$index[rand @index]} = %{$self->_buildTree($maxdepth-1)};
  $self->clear_cache;
} 

This is a little deceptive. What it does is generate a list of the nodes in the syntax tree with _treeIndex(), and then pick a random one and substitute a new randomized branch. This mutation mechanism is pretty drastic, so we don't do it often. The likelihood of improving an organism is very small, though it is important to keep some random element of change happening within the population to keep it from settling at some local maximum that fits the function well, but not as well as we want.

The clear_cache() method simply clears the cached information we built in the expression() method, since it's no longer valid. The index generator looks like so:

sub _treeIndex {
  # Generates a list of all the nodes in this tree. These are
  # references to the stuff in the object itself, so changes to the
  # elements of this list will change the object.
  
  my $self = shift;
  my $tree = shift || $self->{tree};
  my @sofar = @_;
  
  # Dump the content nodes into a list
  
  if ($tree->{contents} =~ /^function /) {
    return(@sofar,
        $tree,
        $self->_treeIndex($tree->{'a'}, @sofar),
        $self->_treeIndex($tree->{'b'}, @sofar)
       );
} else {
    return(@sofar, $tree);
  }
} 

And naturally it is recursive as well. Maybe this article should have been about recursion instead, but we were confident that you already knew all about recursion.

Finally, we take the top n organisms and mate them with a random other organism from the list. Each mating involves taking a tree index of each partner, selecting a random point in each, and grafting the end of one list on to the beginning of the other. This is known as a crossover permutation, which is similar to the genetic scrambling that occurs in sexual reproduction.

The mate() method also uses _treeIndex():

sub mate {
  my ($self, $partner) = @_;
  my $self_clone = dclone $self;
  
  # Get part of a node from $partner and stick it
  # somewhere in $self_clone
  my @clone_index = $self_clone->_treeIndex;
  my @partner_index = $partner->_treeIndex;
  
  %{ $clone_index[rand @clone_index] } = 
  %{ dclone $partner_index[rand @partner_index] };
  
  $self_clone->clear_cache;
  return $self_clone;
} 

This is not so different from the mutation, except that we know that each half of the new tree previously belonged to some relatively successful individual, so the chances are higher that the result will also be fit than if we did a random mutation. Note that we use dclone() from the Storable module to create deep copies of complex structures rather than write our own cloning routine. This is because we are lazy.

The Terrifying Results

When all of this is plugged into a framework that iterates over generations, tests results, and dumps the results on screen, we discover that there is an awful lot of incomprehensible output. This output needs to be examined by hand to determine what's going on, although I'm sure that someone out there is willing to donate routines to assist in the analysis. One run we did for 1000 generations came up with the following results.

By generation 50 the GA had stumbled on a 12 node equation that simplified to 3x3, which gave it a mighty good fit over the complete interval, though this is deceptive as the method we use for testing fitness is inordinately happy with fitting the ends where the numbers are huge. A better fitness test would weight the targets evenly so that the fit in the middle would be better. Still, this gave us a substantially complete fit and showed that the GA could establish the order of the target polynomial very rapidly.

By generation 200 the GA had found the second term - the results reduced to 3x3 + 2x2. Again, given the fitness calculation it seemed unlikely that it would discover any further refinements of interest as the 'value' of such refinements was so small. This outlines the importance of selecting a fitness function that reflects your needs! Our fitness function basically said that we are interested primarily in matching the broad strokes of the curve without regard for the details, which is what we got.

By termination the GA had started fiddling with sines and natural logs in order to better fit the middle region of the graph. This dead end caused it to actually diverge from the real curve outside of the tested interval, while improving the fit inside - while it better met our stated criteria, it was actually taking us further afield. Again, it shows how important it is to tell computers what you want rather than make them guess.

xpolynomialgen 50gen 200gen 1000
-12*-4891-5184-4896-4883.81
-11*-3747-3993-3751-3743.32
-10*-2797-3000-2800-2795.29
- 9*-2023-2187-2025-2023.29
- 8-1407-1536-1408-1408.00
- 7- 931-1029- 931- 932.04
- 6- 577- 648- 576- 577.52
- 5- 327- 375- 325- 326.59
- 4- 163- 192- 160- 161.36
- 3- 67- 81- 63- 63.95
- 2- 21- 24- 16- 16.51
- 1- 7- 3- 1- 1.15
0- 7000
1- 3354.81
223243231.15
389819996.9
4213192224219.94
5413375425418.12
6707648720709.33
71113102911271111.44
81649153616641642.32
9*2333218723492319.85
10*3183300032003161.89
11*4217399342354186.33
12*5453518454725411.03
*outside the range used in calculating fitness

Overall, the convergence to our stated criteria was very rapid and effective, as we had very solid fitness levels and the order of the polynomial correctly established by generation 50.

Other Applications

This GA engine could easily be extended to do different work. Simply fitting functions to data is cute, but by altering the function set, the terminal set, and the evaluation process, the GA can be used to generate behavioral algorithms for robots, control system solutions, and so on. Not that it's going to be easy: your evaluation routines will become very complex for non-mathematical projects, but it's still feasible. Examples from Koza's Genetic Programming include:

The terminal set is MoveRandomly, MoveToNest, PickUp, and DropPheromone.

The function set contains custom routines IfFoodHere(), IfCarryingFood(), MoveToAdjacentFoodElse(), MoveToAdjacentPheromoneElse(), and Progn (which executes a list of instructions - we think we can just drop it and eval everything, but we haven't tried. You tell us.)

It's not instantly clear how to do this, but it should be possible. We look forward to hearing the results of your own experiments on the Perl-AI mailing list!

The terminal set is just a constant from -10.000 to 10.000.

The function set contains XSIN(), XCOS(), +, -, *, and %. XSIN and XCOS are two-argument functions taking the coefficient and the harmonic (truncated to an integer).

There are plenty more, including ones that develop neural nets and other sophisticated decision-making or controlling algorithms. We recommend you pick up Koza's book to investigate further.

Going Further

As research in genetic programming has advanced, some new mechanisms have been considered. One of the more interetsing ones is 'islanding'. This involves running more than one population in isolation from each other, and occasionally migrating some small number of the fittest individuals from one 'island' to another.

When you first run your GA, you will probably notice that the speed of convergence and the quality of the result depends quite heavily on the initial set of organisms and the nature of the first few mutations and matings. This is probably a flaw in our engine somewhere, but if it is fundamental to GAs, then islanding provides a way to occasionally introduce 'ideas' to each system that are both good and novel. In theory this should increase the fitness of each population.

You have probably also noticed that we have built the GA so that it can be distributed - because the algebraic expressions are implemented as strings to be evaluated rather than as references to subroutines, expensive runs can be designed so that the organisms are serialized and shipped to sepearate processors for the fitness runs. We haven't done this yet, so we're looking forward to either your results or your donations of hardware to facilitate the project. A farm of Sparcs would be fine.

Other Fitness Functions

For the purposes of this article, we thought it best to use a relatively naοve fitness calculation (just add the error at all the data points), rather than spend time talking about how to come up with a better function. If you experiment with various other methods for calculating the fitness, you may achieve better results or faster convergence. Let us know how it works out!

Resources

__END__


Ken Williams (ken@forum.swarthmore.edu) lives in St. Paul, MN in an apartment overrun by Organisms. He is the author of a few Apache::modules and some other fun things on CPAN. He recently left the world of Perl consulting to become a full-time high school math teacher. Hide your kids!

Brad Murray is a software analyst for Alcatel Canada, where he is allowed to play with trains, networks, and occasionally Perl. He needs an unusually large bed.


PREVIOUS  TABLE OF CONTENTS  NEXT