PREVIOUS  TABLE OF CONTENTS  NEXT 

Understanding Regular Expressions

Jeffrey Friedl

Maybe I'm somewhat of a kook, but I find a lot of pleasure (and learning) in taking some simple task and really investigating all the ways to go about solving it, comparing and contrasting the various solutions.

Usually the end result is only a better understanding of Perl (and often, Perl's regular expressions), but sometimes there really is a tangible benefit. For example, the Perl FAQ about removing whitespace from strings is the result of a long day of benchmarking.

Not long ago in comp.lang.perl.misc, someone asked if there was an "and" for regular expressions comparable to the "or" in /this|that/. That is, he wanted to find lines that matched two otherwise unrelated expressions.

The quick answer provided by many was an && between two regular expressions: /this/ && /that/, although some offered solutions such as /this.*that|that.*this/.

Randal Schwartz came up with the silly but ingenious /^(?=.*one)(?=.*two)/, and when Tom Christiansen asked why the ^ was included, I started to get the itch to really go at this example.

Knowing vs. Knowing On Paper

Japanese has an expression that translates to English as "paper driver." These people have driver's licenses (hard to get in Japan) but don't have a car and hardly ever drive - they're drivers only on paper. I tend to think twice before getting in the car when they're behind the wheel. Textbook knowledge without experience to back it up doesn't mean much.

In my previous column, I gave the story of Fred (Full Regular Expression Description) that hopefully helped remove the shroud of mystery about regexes. But knowing how the regex engine works is only the first step - it takes experience to turn that into knowing the wider-perspective effects. The aforementioned question related to /this/ && /that/ is basically a simple one, so without the details of the problem getting in the way, I'd like to look at a host of possible solutions. We'll use our knowledge of how regexes actually work to derive points for comparison.

Cooks And Broth

The basic goal is: Given some text, determine whether it can match two unrelated expressions. I've assembled a list of various ways one might go about this, using both my own ideas and those from members of the studio audience (such as Randal's offering). The table below lists them in no particular order, using /one/ and /two/ as the sample regexes.

group contestant number solution
1 1 /one/ && /two/
2 2 /one.*two|two.*one/
3 /one.*?two|two.*?one/
3 4 /one.*two/ || /two.*one/
5 /one.*?two/ || /two.*?one/
4 6 /^(?=.*one)(?=.*two)/
7 /(?=.*one)(?=.*two)/
8 /^(?=.*?one)(?=.*?two)/
9 /(?=.*?one)(?=.*?two)/

The first thing to do is understand if and why they work, and what practical differences they have with respect to exactly what, where, and when they will match. Once understood, we can move on to looking at their relative efficiency.

Will They Work At All?

Contestant #1 is pretty simple to understand, and is the most direct rendition of what was asked. The Group 2 expressions take quite a different approach, noting that either /one/ follows /two/ (that is, /two.*one/) or the other way around. The two contestants of the group differ only in that one uses the greedy .* while the other the non-greedy .*?. They're different, but will always succeed and fail on the same texts.

The Group 3 expressions are comparable to the Group 2 ones, but use the non-regex "or" instead of the regex one. The basic premise of "they're either in this order or that" still holds.

Finally, the Group 4 expressions are all variations on a theme. Because parts of each regex are enclosed in "non-consuming" (?=...) parentheses, Perl will insist that they match, but won't consider them part of the match results ($&). The regexes will simply return true if /one/ matches somewhere (i.e. after /.*/), and /two/ matches somewhere as well. The differing /^/ merely indicates the logical conclusion that if the rest of the expression can't match right away, it won't match later on either (since the /.*/ or /.*?/ run off to the end of the line).

Of course, there are plenty of other regexes, including hybrids like /one(?=.*two)/ || /two(?=.*?one)/ but I'll try to contain my excitement to the nine shown in the chart.

How Do They Differ?

Obviously, there's a large difference among them as to how they affect $& and other side effects. It seemed that the comp.lang.perl.misc fellow merely wanted a true/false answer, so these differences aren't important to us. But the fact that /one/ && /two/ leaves $& in quite a different state than /one.*two/ || /two.*one/ could prove quite important, so it's good to at least mention the issue before dismissing it (as I intend to do for the rest of the article).

However, more to the point, there are differences in which will match what, and two examples should make it clear: one is the text "one\ntwo" and the other "twone".

The first issue (which might never matter unless you work with multiline text) has to do with dot not normally matching a newline. This allows Contestant #1 to match text that the others can't. The /s modifier removes the issue, as it causes dot to match any byte including newline. The issue is even more touchy among the solutions in Group 4. Text such as "blah blah blah\n one and two" could be matched by contestants #7 and #9, but not #6 and #8 (unless, of course, the /s or /m modifiers were used). Think about it for a moment if it's not clear. Tom made a very valid point when he brought up /^/, although for the common case of processing lines, the dot-doesn't-match-newline issue is irrelevant.

The second issue is much more important. Contestant #1 and the Group 4 contestants will match if each regex "can match the target text, period." With the Group 2 and 3 solutions, the serial ordering within the regex requires that the two matches not overlap. It might be a very important distinction.

Which set of semantics is correct? It depends entirely on the particular needs of the situation. If, for example, you are trying to match English words, your regexes should probably include a generous supply of /\b/'s, one result of which is that the differing semantics would be rendered irrelevant (consider that /\bone\b/ && /\btwo\b/ would never be able to match the previous bone of contention, "twone" ). In any case, again, so long as you're cognizant of the issues involved, you can make the appropriate decisions on a case-by-case basis.

Which Is Best?

So, which contestant is best? What exactly is behind door #2? Beauty is in the eye of the beholder, and which is "best" much depends on your criteria for judging. The differing semantics aside, /one/ && /two/ definitely wins the "simplistic beauty" award, while Randal's /^(?=.*one)(?=.*two)/ might go well toward impressing your friends (and, perhaps, judges of TPJ's Obfuscated Perl Programming Contest).

Depending on the need, efficiency may well be important too, and this is where our detailed knowledge of how the regex engine works can really pay off.

Efficiency

In some cases, efficiency is a black/white issue that's easy to analyze: /[abc]/ is much more efficient than /(a|b|c)/, period. But in many cases it's not so clear, involving more than just the regex engine. The rest of this article considers the efficiencies of our nine expressions. Often, because of the many variables involved (particularly, the differing kinds of target text), the discussion becomes necessarily vague. At the end I'll tie things together with some benchmarking, at which point a re-read might help the vague points sink in.

Governing and Following: /.*/ vs /.*?/

So first, let's compare and contrast the .* vs. .*? aspect. If you remember from the previous column, the only difference between * and *? is that *? gives priority to what follows over what it governs, as opposed to giving priority to what it governs. Keeping this in mind while considering /one.*two/ and /one.*?two/, here's a quick quiz: given the text

one two blah blah blah blah blah blah blah

which of the two would find the match more quickly, and why? Simply following in your mind what you know the engine must do will quickly reveal the answer. After matching /one/, the first's /.*/ will blindly zip to the end of the string, at which point it will try to apply /two/ and fail. It will then backtrack once to retry and fail again. In the end, it will have to backtrack almost all the way back before /two/ can match.

Conversely, consider /one.*?two/. Before the *? even attempts the first /./, it will try the /two/ at " tw". It will fail, backtrack and immediately apply dot to match the space after one, at which point /two/ is attempted again and matches. Thus, the *? version matches this data much more quickly than the * version.

What does this tell us about /one.*two/ and /one.*?two/? Frankly, nothing, because the tables are completely turned with the data

one blah blah blah blah blah blah blah two

where the * version out-shines the *? version almost exactly as much as *? outshined * with the first example text. And, of course, I hope Fred's woes in the previous column reminds you to consider a line such as

one blah blah blah blah blah blah blah

that won't match. Which of our two samples will have to backtrack less before it is able to fail? Again, an understanding of the engine helps. Both versions attempt to look at exactly the same matches - the difference is simply in the order checked, so both will have to exhaustively work exactly the same amount before admitting failure.

All this is, of course, "in theory." In practice, .* seems to be optimized quite a bit more than .*? as some benchmarks will soon reveal.

Logical "or", Regex "or": /.../ || /.../ vs /...|.../

Let's leave * vs. *? for a moment and look at the differences between /one.*two|two.*one/ and /one.*two/ || /two.*one/.

They might well appear similar, but are extremely different. Although in this article we're not concentrating on what text is actually matched, it may be instructive to note that with the text

A network's number one priority is accuracy, two is throughput.

one of these expressions will match one way, the other another. Because /one.*two|two.*one/ is one expression, it as a whole is attempted at each successive character position, while the other as two separate regex matches applies /one.*two/ completely before even bothering to apply /two.*one/ (and will only do so if necessary). It might be a good exercise to consider how the other expressions compare.

As for these two expressions' efficiency, we get into the fuzzy area of extra-expression optimizations. Usually, if the engine can decide that some fixed text (particularly leading fixed text) is required for a match, it can use a faster fixed-string search to rule out - quickly - sections of text that couldn't possibly match. With each part of /one.*two/ || /two.*one/, it's relatively simple for the regex engine to realize that matches must start with one and two respectively, and so can use these optimizations to speed things up. Furthermore, in these 'fixed text known' cases, study() can help speed things up even more.

Now consider the expression /one.*two|two.*one/. As the expression is being compiled, the engine can derive the 'fixed text required' for each alternative, but it requires more work to combine the results to realize that any matching line must have one and two in them somewhere. It's conceivable that Perl could eventually do this, but at this point it doesn't, so the result is that none of the extra-expression optimizations are done. Perhaps the amount of time needed to analyze these complicated situations is more than they might eventually save.

Understanding exactly when these optimizations will kick in involves some magic, but if your Perl was compiled with DEBUGGING, you can have it tell you what it's derived about your expression by using the -Dr option:

 
% perl5.003 -Dr -c -e '/one.*two/' 
rarest char o at 0 	
...
start 'one' minlen 6 

The "rarest char" is used for cases when the target text has been study'ed.

For /one.*two|two.*one/, Perl 5.003 doesn't even realize that any match must be at least six characters long:

-Dr reports "minlen 0".

Benchmarking

Figure 1: Contestant from the studio audience

Figure 1: Contestant from the studio audience

When it comes down to it, theory only goes so far - benchmarking representative data is where you get it straight from the horse's mouth. Figure 1 shows some benchmarking with the Group 2/3 regexes. The green numbers across the bottom are our contestant regex numbers from the table on page 32. The various lines represent differing data that I tested each expression with (the legend shows a representation of the kind of data - for example, the black diamond

[one two ---------]

indicates that the test lines had the matches early in the line). There's quite a lot of information here, much of which supports what we've talked about in theory. It may require some study to pick out the information. Here are some deductions:

The green asterisk shows very little work for #4/#5, and maximum work for #2/#3. With the former, the extra-regex optimizations probably short-circuit the regular expressions entirely.

To sum up: We can conclude that /one.*two|two.*one/ is not as good as /one.*two/ || /two.*one/ for a variety of reasons, even if we can find specific data for which the first works faster. (Quiz: can you think of test data that would take substantially longer, like an order of magnitude, with #4/#5 over #2/#3?)

Optimizations of /^(?=.*one)(?=.*two)/

Figure 2: Contestant from the studio audience

Figure 2: Contestant from the studio audience

Now I'd like to look at Randal's /^(?=.*one)(?=.*two)/. By itself, /.*one/ would have some chance at the fixed-string optimizations, but -Dr shows that in this case none are done for the entire expression. In a string where both /one/ and /two/ can match, the overall expression must simply .* to the end of the string and backtrack to /one/, then do the same for /two/. If the /one/ and /two/ are near the end of the string, they'll be found in short order. If near the beginning, the *? version will be the one that finds them quickly. This is easily confirmed by a glance at Figure 2, seeing how the lines cross between #7 and #8.

That #8 and #9 take longer, in general, than #6 and #7 probably reflects the fact that * can be optimized much more than *?, as mentioned earlier.

It's interesting to note that by looking at the graph, the lack of change from #6->#7 and #8->#9 might make one conclude that /^/ makes no difference. Again, the Fred syndrome. With a string that doesn't match (and even worse, one that has many ones but no twos), the non-/^/ versions could take orders of magnitude longer, since they would have to fail all over again when restarting the match from the second character, third character, etc., through the whole length of the string.

Conclusions

Figure 3: Contestant from the studio audience

Figure 3: Contestant from the studio audience

Although I haven't mentioned much about Contestant #1, you might imagine that it performs quite well. It's amenable to the extra-regex optimizations, and won't suffer from any backtracking. With some types of data, it might be slower than some of the other expressions, but generally performs well even in the wild cases. Figure 3 shows the results of all the tests with all the contestants, including a mystery guest #10. The green lines are the no-match cases. The two markers for #7 and #9 indicate that they go ballistic off the scale (up to a relative speed of almost 170), as described in the previous section (the values for #8 and #10 are shown via green boxes since I didn't want to plot the lines across #7/#9).

Figure 4: Contestant from the studio audience

Figure 4: Contestant from the studio audience

As you can see, #1, #6, and mystery guest #10 do generally well (Figure 4 shows a close-up). Figure 3's black line with the #4/#5 spike is for the wild-case test I proposed in the quiz - a test that would have #2/#3 do better than #4/#5.Here it is:

two one one one one one one one one one

where the first expression has to try many times before failing and then matching with the second.

So who is our mystery guest #10? I did these benchmarks using /one/ and /two/ as the sample expressions, but in general they could be any regular expression. However, if they really are fixed strings, then one could just use

index($_, "one") >= 0 and index($_, "two") >= 0

for the test, so I decided to include it for comparison.

THE REAL CONCLUSIONS

I realize that this has been quite a rambling article, but my goal has not to been to show one particular fact or technique or answer, but to expose you to some of the thought processes that I go through when evaluating an expression, and the regex theory and Perl practice that goes behind it. As anyone with a degree from the School of Hard Knocks will tell you, experience is a great teacher, so I aspired to share some of mine with you. One of the most important points I hope you'll take away with you is that when evaluating, make sure to include a wide variety of test data, including various kinds of non-matching data. As Fred will tell you, failing efficiency can sometimes be more important than matching efficiency.

__END__


Jeffrey Friedl works at Omron Corp (Kyoto, Japan), and still spends his copious free time on a regex book for O'Reilly. He can be reached at jfriedl@omron.co.jp.


PREVIOUS  TABLE OF CONTENTS  NEXT