Updates to Chapter 7, “In the World of Regular Expressions”

[This post notes differences between the fifth and sixth editions.]

I just committed the new Chapter 7, “In the World of Regular Expressions”. It was quite an education, even for me, because the character class stuff has changed so much since Perl 5.6, and, since Learning Perl had been ignoring Unicode, we didn’t face the hard problems.

  • The \w character class is almost dangerous now. By default, it represents over 100,000 characters that can match at that position. The \d and \s character classes have the same problem on a smaller scale. It’s unlikely that anyone actually wants these shortcuts anymore, but there are still in older programs. I did cover this over at The Effective Perler, too.
  • Since we’re covering Unicode, this is the right chapter to cover the Unicode properties, such as \p{Space}. Those don’t completely solve the character class shortcut problem because they still match many characters. The perluniprops documentation lists how many characters match each property, which is kinda cool.
  • Perl 5.13.9 includes Karl Williamson’s work to add the /a adverb to enforce ASCII semantics, so we use that
    too even though we don’t really get into options into the next chapter.

This is all rather painful to update because I didn’t want to go through everything assuming ASCII semantics (so, very few changes) then tack on an “if you are using Unicode” section that then invalidates everything. We just have to bite the bullet and make the switch to thinking of Unicode as the default and ASCII as the backward-compatibility special case.