Regexes: Nibbling with pos() - The Perl Journal, Fall 1997

Jeffrey Friedl

Several months ago I started working at Yahoo!, where I apply my text-processing enthusiasm to financial information and news feeds. You can see the result at http://quote.yahoo.com. It's fertile ground for Perl to flex its muscle, but I recently came across a problem that had me rather stumped until the oft-ignored pos() came to the rescue. In this article, we'll take a look at the problem and at a few different tactics I used trying to solve it. I hope it'll provide some interesting techniques to help you with similar problems.

The Problem

As Yahoo receives articles from various news services, we'd like to link them to the news page for each company mentioned in the article. Sometimes the news services encode information about which companies are referenced, and sometimes not. For these articles, I proposed that we scan the articles for company names.

That's a lot easier said than done. Considering that we maintain news on over 15,000 companies, think how you might go about identifying the companies mentioned in any particular article. Cycling through each company name to see if it's present is simple, but would take forever. On another hand, one huge /Yahoo|Intel|Adaptec|General Motors|.../ regex to match all company names would also take way too long to run, assuming Perl even allowed such a huge regex in the first place.

One Approach

Those familiar with the different styles of regex engines will recognize that a huge /this|that|other|.../ set of alternates will generally run slowly in Perl, but faster with a tool like lex or flex (which use a potentially time-consuming pre-processing stage that ends up reaping huge run-time wins with regexes like this). (For more information about how Perl's regex engine works, see TPJ #2, or my book Mastering Regular Expressions published by O'Reilly & Associates) Unfortunately, it turned out that the enormous regular expression required was too large even for flex. I needed a different approach.

A Different Approach

The main obstacle so far is that the size of the problem - the set of all company names - is so huge that we can't use any of the normal tools off the shelf. If we can reduce this, we can return to using those powerful and familiar approaches.

One way to reduce the size of the problem is to consider only those company names that begin with a particular word. Once we have the word "Yahoo," we then have to consider only "Yahoo Inc," which is a simple task. On the other hand, if we have a more common word like "America" we'll have to consider names such as "American Online," "America West Airlines," and "America First Participating Preferred Equity Mortgage Fund," among many others. Still, it's a manageable.

To prepare, I created two databases using tied hashes. One maps all of the first words in company names to a list of possible stock tickers (symbols used to trade the stock, such as YHOO for Yahoo) designating companies whose name begins with the word. The other database maps stock tickers to the full company name. Then, when I come across a word that might start a company name, I use the list of stock tickers to match the full name, reporting the stock tickers of the matching companies. We link news via stock ticker, so ultimately I need to generate a list of stock tickers.

Now, how to actually go about the processing? I first tried a "nibble" approach that I've had luck with before. I put the entire article into a string and nibbled off items from the front, processing each as I liked:

$_ = $entire_article; 
while (length $_) { 
    ## try to grab a leading capitalized word... 
    if (s/^([A-Z]\w*)//) { 		
        my $word = $1; 	
        ## do something with word ... 	 
    } 
    ## get rid of any leading "else" ... 
    s{^\w* # a non-capitalized word, if any 
       \W+ # non-word stuff 
    }{}x; 
}

where the "do something with word" involves checking to see if there's a company beginning with $word, and if so, seeing whether the full name matches at that point. Since $word is stripped from the beginning of $_, we want to see if the rest of the company name can be found there. Omitting the first word of the company name (rather than keeping the full name) keeps the database a bit smaller.(The second substitution in the snippet uses the s/.../.../x form of Perl’s substitution operator, which allows comments and free whitespace (outside of character classes). That substitution is equivalent to s/^\w*\W+//. Readers unfamiliar with the /x modifier and selectable delimiters should see Chapter 2 of Programming Perl, or my book (pages 230 and 247).)

How can we search for the company name? We can't use a simple string comparison with eq because the words might be broken by some amount of whitespace - we'd have to search for /Yahoo\s+Inc\b/ instead of the fixed string "Yahoo Inc." The trailing /\b/, a word boundary, is important since it ensures that the "Inc" is not embedded within some other word. This allows "Yahoo Inc." to match, but prohibits "Yahoo Incubators."

One approach could be to pre-process $entire_article to normalize all spans of whitespace to a single space. This saves having to convert the name to a regular expression (and the overhead of then applying it), but it requires an extra overall search and replace on $entire_article, and removes some of the flexibility of using regular expressions (such as the convenience of /\b/, or recognizing "Inc" and "Incorporated" as equivalent with a regex that matches both, like /Inc(?:\b|Incorporated)/).

Unfortunately, both methods ended up taking several seconds per article - too slow for Yahoo!. I tried a number of tricks to gain some speed. I changed the ticker-to-name database to a ticker-to-regex database, where the regex matched the rest of the name. It's not always a straightforward task to turn a fixed string into a regex when you want to use a trailing /\b/, since something like /Yahoo!\b/ would not produce the expected results (it could match "Yahoo!oohaY" but not "Yahoo! Inc"), so some caution is required. Accomplishing this in a pre-processing stage, before any articles are processed, allows me to compute without worrying about saving cycles. Some of the data I ended up with is shown below:

A sampling of %word2ticker database entries:

    Acme          ACE|AMI|ACU 
    International ...|IBM|...|IGT|... 
    Adaptec       ADPT

A sampling of %ticker2restofname database entries:

    ADPT Inc(?:\b|orporated\b) 	
    ACE Electric\s+Corp(?:\b|oration\b) 	
    ACU United\s+Corp(?:\b|oration\b) 	
    AMI Metals\s+Inc(?:\b|orporated\b) 	
    IBM Business\s+Machines\s+Corp(?:\b|oration\b)
    IGT Game\s+Technology\s+Inc(?:\b|orporated\b)

When the script finds the word "Adaptec," it looks in %word2ticker to see if it might be the start of the company name for ADPT. The other database, %ticker2restofname, then tells us that if we can match /Inc(?:\b|orporated\b)/ we'll have found the company "Adaptec Incorporated."

You'll notice that each regex of %ticker2restofname begins directly with a word. This means that as our scripts nibble off words to be checked, it'll also have to remove any trailing whitespace. This allows the %ticker2restofname regexes to be applied directly. Putting this all together, the main body of the routine then looked like:

MAIN: 
  while (length) { 
     if (s{ ^ ([A-Z]\w*) \s* }{}x ) { # capitalized word
        if ($tickers = $word2ticker{$1}) { 

            # for each company beginning with this word...
            while ($tickers =~ m/([^|]+)/g) { 

               $regex = $ticker2restofname{$ticker = $1}; 
               if (s/^$regex//g) { 
                   ## mark the ticker as seen
                   $found{$ticker} = 1; 
                   next MAIN; 
               } 
            } 
        } 
     } 
     s/^\w*\W+//; # get rid of anything else 
  }

This works well enough, but was still slower than I wanted. One problem is the incessant modification of the string as it's nibbled away. I used the nibbling approach because it was convenient, not because of an inherent need to modify the article. I thought that it might be faster if I could dispense with all the substitutions.

Going On A Diet

Since the nibbling occurs at the beginning of the string, I had to try matching the full name only at the beginning - I could use /^/, which can speed up matching considerably. It's a useful technique, but in this case it's overkill. It was time to stop my nibbling.

Remember the difference between list and scalar contexts when matching globally (m/.../g). In a list context (for example, @matches = m/.../g), the regex is applied over and over, the results are returned as a list (a list of matches, or, if the regex contains capturing parentheses, a list of captured elements for each overall match) which is then assigned to @matches.

In a scalar context, such as while (m/.../g), the "next" occurrence in the target string is matched. Rather than all matches happening at once, one match attempt takes place each time the program executes the regex. The first time through, the first possible match is selected; subsequent iterations (until the target string is modified, or until there are no more matches) select the next match. Consider:

 while ( m/ ([A-Z]\w*) \s* /gx ) { 
     my $word = $1; 
     printf "found [$word] ending at %d\n", pos($_); 
 }

After a scalar context m/.../g is executed (and only then), the built-in pos() function returns the position at which the match ended. Looking at it another way, it's the offset at which the next match will be attempted. (These two viewpoints are almost always the same, but in certain cases where the regex can successfully match nothingness, the regex engine will actually start a match from one character beyond pos() to avoid an infinite loop.)

This pos() behavior applies to the string, not to the regex. (This wasn't the case until Perl 5.) So we can use additional scalar-context m/.../g matches on the same string to match something else where the first match (our word-grabber) left off. Consider the not-quite-correct code:

    	while ( m{ ([A-Z]\w*) \s* }gx ) { 	
        my $word = $1; 	 
        if (m{ \G Inc\b }gx) { 		
            print "Found ''$word Inc''.\n"; 	 
        } 
    }

We can use this approach to solve the company-matching problem more efficiently, but some caution is called for. First, all matches to be done "down the string" must be applied via scalar-context m/.../g operators, and it's easy to forget the /g in a scalar context.

Secondly, if we want a regex to match only at the position where the previous one left off, we have to take care to use /\G/ at the beginning. This prohibits the normal "bump along" down the string in search of a match.

Finally, the remembered position in the string is reset upon a failing match, so we need to save and restore it manually, via a variable that I'll call $offset.

 while ( m{ ([A-Z]\w*) \s* }gx ) { 
     my $word = $1; 
     my $offset = pos($_); # where next match starts 
     if (m{ \G Inc\b }gx) { 		
         print "Found ''$word Inc''.\n"; 
     } else { 
         pos($_) = $offset; # we failed, restore pos 
     } 
 }

Note that we don't restore $offset when the attempt to match the rest of the name (just "Inc" in this example) succeeds. When it does succeed, the position is moved beyond the end of the name - right where we want it so that we can start afresh to search for another name.

When I implemented the full match algorithm using this method, the result looked something like this:

 MAIN: 
   while ( m{ ([A-Z]\w*) \s* }gx ) { 
       	my $word = $1; 	
       if ($tickers = $word2ticker{$word}) 	{ 	 
           my $pos = pos; # next match start position

           # for each company beginning with this word
           while ($tickers =~ m/([^|]+)/g) { 
               	$regex = $ticker2restofname{$ticker = $1}; 		
               if (m/\G$regex/g) { 		 
                   $found{$ticker} = 1; 		 
                   next MAIN; 		
               } 
               pos = $pos; # we failed, so restore match 	 
           } 
       	} 
   }

The speedup over the previous implementation was substantial. However, there were still other ways to speed things up. For example, it's a simple test to see if the character at pos($_) matches the first character of the rest-of-name regex (which, we know, always starts with literal text). This could be as simple as

	   $char = ord(substr($_, pos, 1));

after we know there are company names that begin with $word, and

    	    next if ord($regex) != $char;

just inside the check-each-company loop. This is a quick check that short-circuits the application of most regexes, which saves both regex compile time and execution time.

Another way to speed things up is to help lead the main first-word regex to a quicker match. We know that it will fail at each character position until an uppercase letter is found. Since we care only that we isolate a capitalized word, we can add a leading /\W*/ and it won't hurt a thing. It effectively tells the regex engine, explicitly, that it can continue past any /\W/ characters on the way to matching /[A-Z]\w*/ and the rest.

If we add /[a-z0-9_]\w*/, we can bypass non-capitalized words as well. Thus, a leading /\W* (?: [a-z0-9_]\w*\W* )*/ allows the regex itself (rather than the bump-along-on-failure mechanics of the regex engine) to skip over non-interesting parts of the text. Since we take care of this ourselves, we can use a leading \G to have the regex fail a bit more quickly when no words are left.

Putting this all together, we get

  while (m{\G\W* 	 
              (?: 	
                   # At this point we have either 
                   # nothing or a "word." If not an 
                   # uppercase word, skip and try again. 		 
                   [a-z0-9_]\w*\W* 
               )* 	
              (\w+)\s* 	 
         }gx) 
  { 
      	my $word = $1; 
      	if ($tickers = $word2ticker{$word}) 	{ 	 
          my $pos = pos; # save next match start position 	 
          my $char = ord(substr($_, $pos, 1));

          # for each company that begins with this word...
         	 while ($tickers =~ m/([^|]+)/g) 	 { 		
              $regex = $ticker2restofname{$ticker = $1}; 		
              next if ord($regex) != $char;

              if (m/\G$regex/g) { 		 
                  $found{$ticker} = 1; 		 
                  next MAIN; 
              } 	
              pos = $pos; # we failed, so must restore match 
          }
     	} 
  }

Depending on how the routine is used, there are a few other enhancements I can think of. For example, if it's to be applied to a lot of text, it might make sense to cache the compiled rest-of-name regexes (as it is, they are recompiled each time used, which might be once, or many times per article - they could be cached by enclosing in an eval()'d anonymous subroutine the first time used). All in all, however, we have a much-improved solution at hand.

_ _END_ _

Jeffrey Friedl works for Yahoo!, Inc., and wrote O'Reilly's Mastering Regular Expressions
(See http://enterprise.ic.gc.ca/~jfriedl/regex/).

TABLE OF CONTENTS