PREVIOUS  TABLE OF CONTENTS  NEXT 

Infobots and Purl

Kevin Lenzo

   <JUM> Whenever I set it to not autoload images with
         Netscape 2.01, the whole program locks up. 
         Anyone know why?
<Irving> no
   <JUM> Does anyone know where I can get Netscape 2.0????
   <url> i think netscape 2.0 is at
         ftp://archive.netscape.com/archive/index.html
   <JUM> I am forever grateful, Url.
   <JUM> Url: Are you running ver 2.01 with success?
   <url> jum: bugger all, i dunno
   <JUM> OK.
   <JUM> Thanks, Url
   <url> de nada, jum

You've probably heard about the Turing test, or about the Loebner prize, or other contests that measure how much a program can act like a human - in the last issue of TPJ, there was an article on Chatbot::Eliza, a module that behaves like a Rogerian therapist. Instead of asking how intelligent a program can be, let's explore the usefulness of impersonating a human. Enter the infobot: an autonomous program that converses with users. Infobots are an ongoing experiment in how we can interact as communities with a common institutional memory. If that sounds too grandiose, think of it as a study in interactive graffiti.

IRC? WHAT'S THAT?

The infobots first appeared on the EFNet (Eris-Free Net) Internet Relay Chat (IRC) in June of 1995. On IRC, people talk to one another (typically, to entire groups of people) in channels, each devoted to a particular topic. When a user creates a channel, he or she becomes a channel operator, which gives them powers over other users.

Some channels are popular; the #macintosh channel has about 50 users every hour of the day. The #perl channel has 74 users as I write this. EFNet is the largest noncommercial chat network, with about 40,000 users from around the world at any given moment.

At Carnegie Mellon University, we have operated irc.cs.cmu.edu as an EFNet leaf node since 1996. I've been able to develop the infobots because I was the administrator of the machine; in general, EFNet doesn't appreciate bots, since they are often used for abuse. For instance, bots have been used to spam users with advertisements for porn sites or to take control of other people's channels.

WHOLE LOTTA BOT

One popular type of beneficial bot is the Eggdrop, developed by Robey Pointer. Eggdrops are designed for channel protection; they can be linked together to monitor people or programs that spam users. Unfortunately, they have also been subverted for the exact behavior they were meant to guard against. (Lest you get the wrong idea, most of IRC is simple chatting, not about teenage net.punks and techno-warfare, but there are vandals in every community.)

More benign bots exist as well. Some bots run interactive games, like chaosbot on #chaos or robbot on #riskybus. On #chaos, people join one of two teams and compete to guess items in a top-10 list; on #riskybus, the users wager with fake money in a Jeopardy-like game. Even the IRC server administrators use bots to monitor connections and activity.

Infobots are different. They exist to collect information and answer questions. I have been running infobots for over two years now, and they have continued to evolve with the help of their communities. The bots are missed when they're away - when mine crash and don't respawn, I'll get an immediate "Where's the bot?" when I sign on, even before anyone says hello.

SO WHAT?

Why are infobots so popular? Well, they converse in natural language, they serve as a community memory, and they learn. The initial motivation for infobots came from #macintosh, where the same tired questions were asked again and again. We realized that if the answers were recorded, being helpful would be less of a chore, because we wouldn't have to repeat ourselves all the time. Even if no one on the channel knew the answer, the infobot could reply. This is an act of hubris and laziness, and thus well-suited to Perl.

There are other bots on EFNet: aack on #unixhelp, which has a fixed set of about 150 facts, and Computer and UN, which provide canned answers to questions. They don't learn anything over time, nor do they take advantage of Perl's wonderful text processing. Let's look at some examples of people using one of my bots: url (pronounced "Earl").

<calamari> does anyone know where i can get 
           quicktime 3.0?
     <url> somebody said quicktime 3.0 was at
 http://www.apple.com/quicktime/information/index.html

The question here is a permutation of "Where is X?", or, more simply, "X?", to which url responds with a sentence of the form "X is Y." The surface forms of the replies ("somebody said...") are chosen at random to make url seem less mechanical.

My bots don't learn in a deep Artificial Intelligence sense, but they go get more helpful over time. url, like a sponge, soaks up information that it hears on the channel.

ARE YOU SPONGEWORTHY?

Yes. Everyone is. Whereas other bots are strictly information stores loaded by a priesthood of factoid keepers, mine are egalitarian. Here's a demonstration of purl, the infobot that lives on the EFNet Perl channel, #perl, 24 hours a day.

<juice> i am not very happy
<juice> me?
<purl> you are, like, not very happy

purl has soaked up quite a few factoids, as has her older brother url.(It's interesting to note that people often refer to purl as 'she,' while url is almost invariably a 'he'.) She can learn from any declarative statement including a verb of being ('is', 'are', 'am', and so on). url is more constrained, and requires that the part of the sentence after the verb contain a recognizable URL. The 'http://' isn't required; both url and purl infer it when necessary.

On #perl:

<oznoid> purl, status?
  <purl> Since Thu May 21 17:50:02 1998, there have been 
         327 modifications and 263 questions. I have been 
         awake for 23 hours, 49 minutes, 8 seconds this 
         session, and currently reference 32513 factoids.

On #macintosh:

<oznoid> url, status?
   <url> Since Thu May 21 17:50:03 1998, there have been 45
         modifications and 509 questions. I have been awake
         for 23 hours, 57 minutes, 54 seconds this session,
         and currently reference 46159 factoids.

url is also on the EFNet channels #macdev, #linuxos, #avara, and #distributed. There is also a version of url running on the Undernet; their databases were originally synchronized, but they are now learning and interacting independently. (The dates shown are when the current processes began, not when the bots began learning.) A relative of url and purl, called rurl, hangs out on #robogeeks. It's less an information repository than a personification of bad attitude.

Sometimes people think they're interacting with an actual human being; the excerpt at the beginning of this article is real. I've seen people come into a channel, ask a series of questions and get pointers to good information, and then thank the bot and say "Thank God someone was listening!" It's even been hit on.

USING IRC AND BOTS
To experience these bots in action, sign onto IRC and check them out. You'll need an IRC client, which you can download for free from the sites below. To join the Perl channel, type /join #perl.

http://www.irchelp.org/: More information about EFNet and IRC
http://www.cs.cmu.edu/~lenzo/infobot.html and http://www.infobot.org: Infobot home pages
http://www.irchelp.org/: The IRC Help home page
http://www.valuserve.com/~robey/eggdrop/: The Eggdrop home page
http://www.networks.org/irc/computer.html: The Computer bot
http://networks.org/irc/: The UN bot

YOU CAN'T DO THAT IN PUBLIC!

This brings us to the question of what goes into the interaction. Why does it sometimes fool people, and what keeps it from making inappropriate comments? Admittedly, it picks up many irrelevant things and occasionally spits out a useless comment, but it doesn't happen often, and when it does it merely adds to the charm.

People on #macintosh endured the early development cycle with great aplomb, providing a lot of feedback. I get feedback from several channels now, but it's still mainly #macintosh and, relatively recently, #perl that supply the most feedback. Here's an example of a discussion about how the bot should behave on #macintosh. (Just to clarify, I go by the nickname oznoid, or oz for short.)

<tonyola> oz - a suggestion - url should only accept queries
          beginning with "url,...."
<golgo13> tonyola: No.
          * Zebe gives bizzy a botsnack
  <bizzy> i hate having to type "url"
  <bizzy> :)
 <durkin> oznoid, do you plan to make his database of facts
          available. i mean, i think his large knowledge is
          one of the most coveted features..
  <barry> give him a preferences window :)
<golgo13> Half of url's usefulness is answering questions from
          people who don't know to address him.
  <barry> url, preferences..then you set what ya like :)
<tonyola> but url keeps breaking in when he's not wanted
          otherwise
 <Fatale> is that really that much of an inconvenience,
          compared to his usefulness?
  <_elle> url is allowed to interrupt anytime :)
 <Fatale> i often wonder why it's such a problem, when you
          have free access to all that information
  <bizzy> heh
<golgo13> It is ridiculous for for someone to come in and say
          "Where is stuffit expander?" and then for a regular
          to go "url: Stuffit expander?"
 <durkin> url, nslookup fall.back.host.net
    <url> fall.back.host.net is 205.166.250.10

Here are a few design decisions I've made:

<picklehed> purl, where can i find a blue petunia?
     <purl> picklehed: bugger all, i dunno
>purl< perl?
[purl] well, perl is the Swiss Army Chainsaw of programming,
       available at http://www.perl.com/.
<longhair_> purl, tell johnpc about fish
     <purl> okay, longhair_, i did.
<humblepie> the perl journal
     <purl> it has been said that the perl journal is at
            http://www.tpj.com/
<fimmtiu> purl, leave #perl
   <purl> goodbye, fimmtiu.
          *** purl has left channel #perl
<fimmtiu> Hey, purl.
   <purl> what's up, fimmtiu
<PerlJam> hi purl
   <purl> niihau, PerlJam

'Niihau' is 'hi' in Mandarin. The bots won't understand all greetings, however:

   <Murr> yo purl-boy
<fimmtiu> What's up, purl?
   <purl> up is, like, what's not down or the direction
          parallel and opposite the vector sum of all
          gravitational forces at a given point
 <JohnPC> re, purl
   <purl> re is short for "rehi", or "hello again". In Perl,
          it's an acronym for "regular expression".
 <oznoid> x?
   <purl> x is, like, my friend
 <oznoid> no, purl, x is y
   <purl> okay, oznoid.
 <oznoid> x is also z
   <purl> okay, oznoid.
 <oznoid> x?
   <purl> rumour has it x is y or z
<PerlJam> dammit purl, listen to oznoid when he's talking
          to you!
<PerlJam> ;-)
 <oznoid> purl, forget x
   <purl> oznoid: I forgot x

WHAT ARE YOU THINKING?

When someone enters a channel and asks a question, the bot will answer if it understands the question and knows something apropos. It parses the question with rules like these:

$finalQuestionMark = (s/\s*\?+$//); 
# get the "?"
s/^\s*does any(one|body) know\s+//i; 
# remove polite pre-query
if (s/^(who|what|where)\s+//i) { 
     # we've got a question
     $questionWord = $1;
}
if (s/^(is|are)\s+//i or s/\s+(is|are)$//i) { 
     # get the verb
     $verb = $1;
}

The bot then checks to see if it recognizes the remainder of the sentence. It does this by reducing the sentence to a canonical form: a standardized representation for the data that retains only the important information. In this case, that means no determiners ('a', 'an', or 'the'), and a few other simplifications. This covers queries such as:

Where is foo?
does anyone know where foo is?
what are foo
foo?

The infobots actually have a long list of verbs to check; the code shown above is for demonstration purposes only. This example only shows portions of the text being eliminated; in reality, the substitutions are used later to canonicalize the sentence for more advanced processing.

It was easier to do things this way than to have url parse a full grammar specified in Backus-Naur Format (BNF). One reason is that people define new words on the fly, so it's not possible to specify the complete grammar in advance without using a /.*/ placeholder to allow for new words. Furthermore, people don't speak in well-formed sentences very often, especially on IRC. There are even textual equivalents of filled pauses ('um', 'ah') that are quite communicative, but difficult to parse because they can appear anywhere and can be spelled in unusual ways.

Eating fragments of the input and storing them in variables allow the infobots to build up the contents of a semantic frame (a data structure containing attribute-value pairs) by eliminating the known fragments and judging the significance of the remainder. The approach is quite similar to Phoenix grammars or semantic phrase parsers developed by Wayne Ward and others, which consume chunks of the input to build up a frame. Things like this have been used quite successfully for parsing in speech recognition tasks such as the ATIS (Airline Travel Information System) systems developed at CMU, MIT, and elsewhere; they continue to be a part of our work at CMU.

At a high level, the processing goes like this:

  1. Is the sentence obviously a question?
    Yes: url canonicalizes the question and replies if it has answer. Otherwise, url replies that it doesn't know, but only if it was directly addressed. Questions also can include mathematical expressions for evaluation or requests for information from third parties.
    No: If url recognizes the key (the left-hand-side X of a statement such as "X is Y", or the right hand side of a question like "Where is X?") in the database, it volunteers the factoid from the database if the key is long enough (e.g. more than 6 characters) or if url was directly addressed.
  2. Is the sentence an explicitly defined functional form?
    My bots understand commands like nslookup (which converts computer names to IP addresses) or internic (which performs an Internic "whois" query)? If so, url executes the external command and displays the result.
  3. Is the sentence parseable as a statement ("X is Y")?
    Yes: url checks to see whether X is already in the database. If so, it looks to see whether the user was aware of that - if he said, "No, url, X is Y" or "X is also Y". In that case, url replies, "ok, ". Otherwise, it says, "...but X is Y..." if it was addressed directly. This prevents the bots from paying attention to O-dogg below:
       <orwant> bless is a function that tells a reference
                that it's now an object.
       <O-dogg> bless is cool.
    
    No: url admits confusion, but only if it was directly addressed.

As you can see, url can act very differently depending on whether it was addressed directly. That happens when a user calls url by name, or sends a private message. It replies in kind:

<paraquat> purl, do you have a life?
    <purl> i haven't a clue, paraquat
           A slightly more involved interaction:
<paraquat> purl, PDL is at
           http://www.perl.org/CPAN/modules/by-module/PDL/
    <purl> ...but pdl is Perl Data Language...
<paraquat> purl, PDL is also at
           http://www.perl.org/CPAN/modules/by-module/PDL/
    <purl> okay, paraquat.
<paraquat> pdl?
    <purl> pdl is probably Perl Data Language or at
           http://www.perl.org/CPAN/modules/by-module/PDL/

You can see that it's not always as fluid as one would like, but it gets better through use. When an infobot annoys the channel, I hear about it pretty quickly.

IDIOT SAVANT

The infobots grow every day, and some of them have over 40,000 factoids. They have to be able to access their information quickly, and checking every item in the database against the input would take too long. The infobots use tie() and DB_File, so that the database appears as a pair of large hashes but actually reside on disk, and can be looked up without requiring lots of RAM. Of course, that means whatever you're looking for has to match the hash key exactly. That's why the input must be canonicalized, stripped of irrelevant utterances like 'hmm...', 'a', 'an', and 'the', among others. This canonicalization is performed both when the item is stored and when the bot tries to answer a query. This doesn't always work quite the way you'd like, and so you occasionally get Rainmanesque responses: "aaaah... i know something about X... aaah... That would be Y, yeah".

Since the entire database resides on disk, and not in memory, the run-time memory requirements are reasonable - on the order of 5.5 megabytes for url, the largest bot. Because no explicit database search is required, the bot's responses are nearly instantaneous, which is important for the real-time communication on IRC. Currently, we use one database for singular entries and another for plurals, so it's quite easy to maintain subject-verb agreement even when the query has no verb. I'm merging them into a single, unified database in which each entry has an attribute for each case.

A SENSE OF PLAY

url has an elder brother named hocus, who lacks the frontend filters that restrict knowledge to URLs. hocus soaked up anything that looked declarative - anything at all. When the first infobots appeared without this restraint, people quickly became annoyed by the frivolous responses. purl is actually quite hocus-like, but with more stop expressions. Those reject some queries and prune others.

Once people realize the bot is a bot, they often have a little fun with it, stuffing it with questionable information. The bots aren't shy about expressing their newfound ideas in public, sometimes to the annoyance of channel curmudgeons.

Eventually, those who find the bot annoying at first usually begin to appreciate its usefulness. Still, the play continues, sometimes sparked by a random bot comment and leading to the definition of a few humorous factoids.

  <hindeur> purl, hinder?
     <purl> hmmm... hinder is at http://www.ubertone.ml.org/
<q[merlyn]> you need to say "is also"
 <rootbeer> When debugging regexes, the $& and related
            variables can be quite informative.
<q[merlyn]> otherwise it won't overwrite a meaning
  <hindeur> purl, hinder is also amagosa's friend
     <purl> okay, hindeur.
<q[merlyn]> there
<q[merlyn]> like that
<q[merlyn]> purl is also a miss take!
  <hindeur> this bot is fun!
     <purl> okay, q[merlyn].
<q[merlyn]> purl?
     <purl> yes, q[merlyn]?
<q[merlyn]> purl, purl?
     <purl> it has been said that i am the bot that's got
            the info that's hot and a lot that's not and all
            of that rot or a knut or buggy sometimes. or a
            miss take!
     <Taer> sorry. wife stole camel 2ed.
            * nickm gets funky with purl
<q[merlyn]> purl is also funky

...time passes...

    <nickm> purl is a commie pinko sympathizer.
     <purl> ...but purl is the bot that's got the info that's
            hot and a lot that's not and all of that rot or a
            knut or buggy sometimes. or a miss take! or funky
            or a bit shy or a sweater! or a really really cool
            bot....

WHAT? WE ORDERED NO PIZZAS!

Bots are a public resource, and as such as susceptible to vandalism. Some people try to break the bots. That's not a bad thing if they're just being playful and testing the limits, but sometimes it's an act of malice. For instance, since any user can change or delete entries, we occasionally find useful nuggets of information replaced by phrases like "Microsoft sucks" or "Apple sucks". At least the damage is visible, and rapidly fixed by the channel.

At one point, I allowed people to get a random entry from the database. Soon someone was steadily lobotomizing the bot, by retrieving random entries and having the bot forget them, one by one. Since I maintain logs of the bot's interactions, I was able to back out all the changes from that user, but the experience was unpleasant.

The math handling abilities of the bot use eval(), which can be extremely dangerous. Early experience made me check the input very carefully before evaluating it - this is why it won't handle all expressions. Even the Safe module isn't adequate protection; the documentation enumerates some of the side effects, such as infinite loops and memory hogging, that might occur from certain inputs. I simply reject anything outside of a very limited subset of functions. Even with this precaution, users found and manipulated a bug in the system libraries that made the bot crash.

The channels get quite defensive of the bots when they see people trying to vandalize them, or even when people are merely rude. This is prevalent on #macintosh, where url has been a fixture for a couple of years.

FUTURE DIRECTIONS

The infobot code is undergoing substantial revisions; for one thing, it is being modularized. Since there are now versions that work standalone on the desktop, with IRC, or with Zephyr (another messaging system), the infobot code is being decoupled from the communications protocol. Net::IRC, a module designed specifically for manipulating the Internet Relay Chat protocol, will be integrated into the infobot code. Some of us have also been talking about connecting several bots together, expanding the types of statements and questions, implementing per-channel 'personalities', and connecting networks of infobots - each an expert on its own topic. We have settled on the Bazaar model rather than the Cathedral model of software development, as described in Eric S. Raymond's "The Cathedral and the Bazaar." We want as many people getting the source as possible, making interesting modifications to it, and giving it back to the community. Just like Perl itself.

WHERE TO GET IT

The code is a mess right now. Fortunately we have a group of people and a mailing list, and we're redesigning it from scratch. By the time this article is published, we are hoping you won't have to use the collection of barnacles that comprise the current infobot. The source has been available for some time now, warts and all - I had to get over my desire to wait until it was perfect. It's far from perfect now, but it has been improving quickly since the public release. To get on the mailing list, send me mail.

THANKS

People who have been running infobots have been a great resource, particularly Adam T. Lindley (abstract on IRC), who has done extensive work with script and set up the web site at www.infobot.org. Patrick Cole (wildfire on IRC) helped replace the IRC client code and the user profiling system. Dennis Taylor (fimmtiu), who is also involved with Net::IRC, was also a great source of ideas and inspiration.

All my work is due, in one way or another, to a love of Perl itself. Perl's philosophy appeals to me more than any other programming medium. Thanks to p5p and all of the people working on Perl. We owe ya, bigtime.

REFERENCES

Raymond, Eric S. The Cathedral and the Bazaar. http://earthspace.net/~esr/writings/cathedral-bazaar/cathedral–bazaar.html. Ward, Wayne. "Understanding spontaneous speech: the Phoenix system." Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1991, vol.1, pp. 365– 367. Seneff, Stephanie. "Robust parsing for spoken language systems." Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1992, vol. 1, pp. 189–192. Nolan, John. "Chatbot::Eliza." The Perl Journal Issue 9, pp. 16– 18.

__END__


Kevin Lenzo (lenzo@cs.cmu.edu, www.cs.cmu.edu/~lenzo) is a bass player, IRC admin, Perl junkie, and Robotics graduate student at Carnegie Mellon University specializing in speech technology. He may graduate again by the next millennium.

PREVIOUS  TABLE OF CONTENTS  NEXT