PREVIOUS  TABLE OF CONTENTS  NEXT 

Understanding Regular Expressions

Jeffrey Friedl

"To Be Continued..."

In my last article, I used our knowledge of how Perl's regex engine goes about a match to analyze and evaluate a few different solutions to a problem. I'd like to continue to look at the effects of greediness, backtracking, and other important aspects of Perl's regex engine, this time to see some gotchas that await the unwary. A basic understanding of backtracking is a prerequisite; I recommend seeing my first article, "The Story of Fred" (TPJ Vol. 1, Issue 2). Let's start off with a simple but illustrative example taken from daily life: continuation lines. Let's say you've got the text of a csh-style configuration file in $_, and want to pluck alias definitions. You might use

while (m{^ \s* alias \s+ (\S+) \s+ (.*) }xmg) 
{     ($alias, $cmd) = ($1, $2); 
    ...work with $alias and $cmd as you like...
}

If your string is from a tcsh shell's startup script and has a line such as

 alias ls 'echo Ha, got you, sucker! ; rm *'

(put there as some idiot's idea of a practical joke), the snippet above works fine. However, if the jerk tried to be smart and cover his tracks a bit, the "line" might look like:

alias ls 'echo Ha, got you, sucker! ; rm * ;\           
           alias ls echo Ha, got you, sucker!'

This would result in the rm action of the alias happening just once, so by the time you know to look for an alias, you won't see the rm. Anyway, this kind of "line" would break our approach, since it's two physical lines.

Well, we know regular expressions, so no problem! A continuation line is a backslash followed by a newline followed by the next line. We can express that directly with: /\\ \n .*/x.

In adding this to our match, we'll want to wrap it with (?:...)? because it's optional. Actually, we can use (?:...)* to allow any number of lines.

Adding this to our match, we get

m{^ \s* alias \s+ (\S+) \s+ (.* (?:\\ \n.*)*) }xm

It won't work.

If this surprises you, go back to the basics of how Perl's NFA regex engine goes about a match attempt: By the time the first /.*/ stops and control moves to /\\ \n .*/x, the backslash that was ostensibly supposed to be matched by\\ has already been matched by /.*/, and so the whole /\\ \n .*/x fails. That's okay, since it's wrapped by (...)*, but the result is that the match, only to the end of the line, is not what we intend.

It's interesting to note that this is yet another example of why the common "Perl regexes are greedy" myth is wrong. Were they truly greedy, the longest possible match, one comprising both lines in this case, would be found. The saying should be "Perl quantifiers are greedy." In this case, "too greedy" would be appropriate.

Ah, but not all Perl's quantifiers are greedy. Our regex uses seven quantifiers: three pluses and four stars. What if we made some of them non-greedy? Which ones? Well, it doesn't matter for the first three, since with any given text there is no choice about what they match, so the order of checking doesn't matter except for efficiency. (This was a main topic of my article in the previous edition of TPJ - on a local level, /\s*/ is much more efficient than /\s*?/.) [This /^ \s* alias \s+ (\S+)/x is a situation where it would be nice to have quantifiers that are "possessive," never giving up what they match. In this case, such quantifiers would provide for a fair measure of efficiency - due to how they're used in the overall regex, we can see that any backtracking they might do is irrelevant, so telling the regex engine that they shouldn't do any would save time.]

Our problem cropped up because /.*/ was too greedy, so what about "degreedying" that one? If we use /.*?/, it checks what follows, /(?:\\\n.*)*/, before each attempt at its own dot. This means that the backslash will be matched by /\\ \n .*/x if possible, pulling it out from under the initial /.*/.

Along the same lines, we'd want the second /.*/ to be /.*?/ as well. What about the quantifier governing (:...)? Making that non-greedy tells the regex engine that it's okay to not even bother checking for continuation lines before ending the match, so we definitely want to keep it greedy.

This leaves us with:

m{^ \s* alias \s+ (\S+) \s+ ( .*? (?:\\ \n .*? )* ) }xm

Does all this make sense?

I sure hope not, because this regex won't work either!

Think about what happens when we reach /.*? (?:\\\n.*?)*/x. The initial /.*?/ knows that it can match any (non-newline) character, but because it is non-greedy, it will first defer to what follows, only returning to actually try /./ if and only if what follows isn't successful. What follows is governed by star, so is always successful. It might not match anything, but being successful even when not matching is what star is all about! If the alias begins with a continuation backslash, it and its newline will match (and match as many empty continuation lines as are in a row), but otherwise, /.*? (?: \\ \n .*? )*/x at the end of a regex like this is guaranteed not to match any text.

It might have been surprising because my description was meant to lead you into falling for it working, but there's no magic here - it all makes perfect sense in the scheme of how backtracking works. If a cup of coffee and a second reading doesn't clear it up, go directly to my previous article in Volume 1, Issue 2. Do not pass Go and do not collect $200.

So, how could we match continuation lines in this case? Well, we want a bunch of "escaped newlines or other stuff," so how about using

m/(?: \\ \n | . )* /x

instead of /.*/ in the original regex? We know that Perl tries each alternative in turn, so if a backslash is followed by a newline, it will be matched before /./ can get to the backslash, thereby vaulting us past the newline:

 m{^ \s* alias \s+ (\S+) \s+ (?: \\ \n | . )* }xm

There is one problem with this, however. A line ending with \\ before the newline is not a continuation line, but will be treated as one by our regex. The first backslash is not part of an escape-newline sequence, so is not matched by /\\ \n/x. This is good. However, the next backslash can match it, which is bad. /\\ \n/x vaults us past escaped newlines, but we really want it to vault us past any escaped sequences.

We'd like to be able to replace the /\n/ by /./, but dot doesn't normally match a newline. That's a key feature of dot's other use in the regex, so using the /s modifier to have dot match newline would cause a problem. It can be taken care of easily, and the result creates a cleaner regex, I think:

m{^ \s* alias \s+ (\S+) \s+ (?: \\. | [^\n] )* }xms

Notice that I used both the /m and the /s modifiers? That creates what I call a clean multi-line matching mode: the anchors will match at embedded newlines (the influence of /m), and dot will match any byte (the influence of /s).

This regex will now work as desired, but the alternation at each character is not very efficient. One thing that might come to mind to speed it up is to quantify /[^\n]/ with a plus. There are two concerns. First, considering only [^\n], the change makes it an effective /( [^\n]+ )*/x, which is the quintessential neverending match that caused Fred so many problems in my first article. Ah, but nested quantifiers alone do not a neverending match make. A neverending match "never ends" because it is checking a bazillion permutations in search of a match. That's not a concern here because (...)* can never fail. Were there something after it that could fail, then our regex /(?: \\. | [^\n]+ )*/x would indeed be very dangerous. But it's okay in this case.

Secondly, we'd relied on the position of /\\./ in the alternation to have it (rather than /[^\n]/) match a backslash when possible. But now, there's nothing stopping /[^\n]+/ from matching a backslash...until we change it to /[^\n\\]+/.

So, making these changes and benchmarking, I find that

/(?: \\. | [^\n\\]+ )*/x

is about 30 times faster than

/(?: \\. | [^\n] )*/x.

Wow!

As a teaser to what I'm sure will be a future article, let me tell you that this part of the regex can also be written as:

/[^\n\\]* ( \\. [^\n\\]* )*/x

This is derived using a technique I call unrolling the loop, and has a number of benefits. It's about 15% faster than the fast regex in the previous paragraph, and it doesn't suffer the neverending-match fate if inserted before something that could fail. It's a powerful technique that I use often.

Unabashed Commercialism

If you flip the channel at the first glimmer of a commercial, then turn the page now, because I'd like to announce that I've finally finished my book, Mastering Regular Expressions, published by O'Reilly & Associates. It should be hitting the bookstores about the same time this issue of TPJ hits your mailbox.

The book is not about Perl per se, because it covers the concepts of regular expressions that apply to many programs. But regular expressions are not used in a vacuum: the chapter on Perl is over one hundred pages. MRE has a couple of pretty hip owls on the cover; stop by a bookstore and check it out. I think you'll be pleased.

__END__


Jeffrey Friedl works at Omron Corp (Kyoto, Japan), and now spends his copious free time relaxing after a long two and a half years working on his book. He can be reached at jfriedl@omron.co.jp.


PREVIOUS  TABLE OF CONTENTS  NEXT