The evolution of character class shortcuts

Character class shortcuts used to be easy because ASCII was easy. Either of those were easy if ASCII was what you wanted, but quite limiting otherwise. Perl v5.6 introduced Unicode support and the world started to change.

evolution

A good programmer is always trying to eliminate ambiguity. Their code should work the same way everywhere, but character class shortcuts can’t guarantee that anymore. I wrote about these a bit in Know your character classes for The Effective Perler.

Explaining these in the first Learning Perl was easy because it was simpler times. Strings were a sequence of octets, each of which represented a particular character in the range from 0 to 127 (above that not so much). This made the character classes easy:

\d [0-9] digits
\s [ \f\t\n\r] whitespace (but not vertical tab)
\w [a-zA-Z0-9_] “word” characters

Perl v5.6 added a new sort of string. Now we had the raw, octet strings (something the bytes module lets you peek at), and UTF-8 strings. We’re not supposed to know any of this, but there it is. If you’ve used some of the JSON modules, you might have been forced to remember this because they expect UTF-8 input like you’d take directly out of an HTTP message body or a raw read from the disk. That’s already stored as UTF-8 before Perl does anything with it.

Perl v5.10 shifted the meaning of \w and \d. The perlrecharclass docs note three situations for the new \w:

  • If the internal representation is UTF-8, \w matches “those characters that are considered word characters in the Unicode database”
  • If there’s a locale in effect, \w matches “those characters that are considered word characters by the current locale.”
  • If neither of those, they match [A-Za-z0-9_].

Take a moment to realize how cool perldoc.perl.org is. You can look up the docs back to v5.8.8. Now we just need a visual representation of the differences over all versions.

The middle situation with the locale doesn’t bother me that much. The match operator will not untaint data when the character class shortcuts are subject to locale definitions since those definitions are external data (so tainted themselves). Security isn’t the only issue though. Correct and expected behavior counts for something.

The \d meaning shifted too. Instead of 10 ASCII digits, it matches 550 characters in v5.24, all of them decimal digits albeit from different scripts. The \s matched five characters up to v5.10, then expanded to include the Unicode spaces in the UTF-8 case. In v5.18, \s matched the vertical tab in the default ASCII mode (if anyone uses the vertical tab for its purpose, please tell me about it in the comments). In v5.24, it matches 22 characters in Unicode mode. That’s the situation now.

How do you know which situation your pattern will encounter? You can have UTF-8 and octet strings in the same program. Which one will end up in the variable you bind to the match or substitution?

Knowing something can go wrong isn’t as important as figuring out the consequence of that. We’re all probably sloppy with some patterns until something makes us pay more attention. If we match an Arabic-Indic numeral (like ٩), how devastating will that be? Or, does anyone care that you might miss out on matching a vertical tab? I didn’t find any public security issues where the character class shortcut was the problem.