regular expressions – Page 2

Updates to Chapter 9, “Processing Text with Regular Expressions”

[This post notes differences between the fifth and sixth editions.]

I didn’t have to make many changes to this chapter. I wanted to put in at least one Perl 5.14 feature, but the only new thing that the substitution operator gets is the /r modifier.

While working through this chapter though, I started to wonder if our terms in the previous editions were the same in the Perl documentation. We called the modifiers “option modifiers”, and sometimes “flags”, perlre just says “modifier”. Personally, I’m used to saying “flag” all the time and I like that term just fine, but for regular people, there’s nothing to connect the everyday use of “flag” to the thing after the match operators. So, “modifier” it is. I’d much rather use “adverb”, which is popular in Perl 6 land, but it’s a bit late for Perl 5 to change terms. When I made the switch in this chapter, I had to go back to Chapters 7 and 8 and do the same thing.

This chapter is also curious in that it ends with a long example the builds up to a perl one-liner. One of the things the reviewers noted about a new edition was a new chapter devoted to one-liners. That’s still possible, I guess.

Updates to Chapter 8, “Matching with Regular Expressions”

[This post notes differences between the fifth and sixth editions.]

There are a couple of interesting updates for Chapter 8. The small change is the slight modification of a footnote. We mentioned that the performance problem of the match variables $& and friends wouldn’t be solved before Perl 6. However, with Perl 5.10’s introduction of the /p match operator flag, problem solved!

Chapter 8 also has a subtle shift in thinking about anchors. Perl 5 introduced the \A, \Z, and \z regular expression anchors. Somehow, never made the shift from the Perl 4 anchors ^ and $. Even after Perl Best Practices pointed out the problem, we failed to update the Llama

I’d never really bothered to check when Perl introduced \A until today. That’s a task I do quite frequently: when did some feature show up in Perl? I could just go through all the tarballs, unpack them, and look at the documentation, but there’s an easier way. Since I have a clone of the perl repository, I have access to the entire perl development history. Each release has a tag, and I can list all the tags:

$ git tag
perl-1.0
perl-1.0.15
perl-1.0.16
perl-2.0
perl-2.001
perl-3.000
perl-3.044
perl-4.0.00
perl-4.0.36
perl-5.000
perl-5.000o
perl-5.001
perl-5.001n
perl-5.002
perl-5.002_01
perl-5.003
...

If I want to see what was going on in a particular release, I checkout the appropriate tag:

git checkout perl-5.000

Now I can see the state of the repo at the point of that release. Sure enough, C<\A>, C<\Z>, and C<\z> are in the documentation back then.

Captures, memories, and clusters

Language evolves, Perl evolves, and Perl being the product of a linguist, the language of Perl evolves.

When I started using Perl, we didn’t have a formal name for the result of parentheses in a regular expression. Continue reading “Captures, memories, and clusters”

Updates to Chapter 7, “In the World of Regular Expressions”

[This post notes differences between the fifth and sixth editions.]

I just committed the new Chapter 7, “In the World of Regular Expressions”. It was quite an education, even for me, because the character class stuff has changed so much since Perl 5.6, and, since Learning Perl had been ignoring Unicode, we didn’t face the hard problems.

The \w character class is almost dangerous now. By default, it represents over 100,000 characters that can match at that position. The \d and \s character classes have the same problem on a smaller scale. It’s unlikely that anyone actually wants these shortcuts anymore, but there are still in older programs. I did cover this over at The Effective Perler, too.
Since we’re covering Unicode, this is the right chapter to cover the Unicode properties, such as \p{Space}. Those don’t completely solve the character class shortcut problem because they still match many characters. The perluniprops documentation lists how many characters match each property, which is kinda cool.
Perl 5.13.9 includes Karl Williamson’s work to add the /a adverb to enforce ASCII semantics, so we use that
too even though we don’t really get into options into the next chapter.

This is all rather painful to update because I didn’t want to go through everything assuming ASCII semantics (so, very few changes) then tack on an “if you are using Unicode” section that then invalidates everything. We just have to bite the bullet and make the switch to thinking of Unicode as the default and ASCII as the backward-compatibility special case.

“captures” versus “memories”, “group” versus “buffer”

The term “memories” to label the side effects of parentheses has fallen out of favor. The new hotness is “capture group”, although that has sometimes showed up as “capture buffer” in the documentation. Karl Williamson, however, purged the docs of “capture buffer”, so you shouldn’t see that anywhere in Perl 5.14’s docs. This mostly affects Chapter 8, where we introduce the match variables, even though we have grouping and backreferences in Chapter 7.

I’m not so sure I like “groups” everywhere though. I think that’s the right term to apply to the particular parentheses that triggered the capture, but not necessarily the thing actually captured. It’s the difference between asking which team is in the Super Bowl and who’s on the Super Bowl team.

I don’t really care that much, though, because there’s one overriding concern: we need to use the same terms that are in the documentation so people have the right search terms.

Perl 6 has a thing called captures, but that’s a completely different beast.