Matching Perl identifiers is a lot harder now

In the Learning Perl Student Workbook (first edition), I had an exercise to match a Perl variable with a regular expression. This is supposed to be a simple exercise with a simple answer, so I excluded any special variables, such as $1 or ${^UNICODE}.

A long time ago in a Perl far, far away, when character classes were much smaller (see Know your character classes under different semantics), the pattern sees simple:

$candidate =~ qr/\A[\$%\@][a-zA-Z_]\w+\z/;

If you want to read that easier, you can use the /x modifier to add insignificant whitespace and comments:

$candidate =~ qr/
  \A             # beginning of string
  [\$%\@]     # sigil
  [a-zA-Z_]   # first character
  \w+         # others can have digits
  \z             # end of string
  /x;

You can make this slightly simpler with case insensitivity:

$candidate =~ qr/
  \A             # beginning of string
  [\$%\@]     # sigil
  [a-z_]      # first character
  \w+         # others can have digits
  \z             # end of string
  /ix;

Now, in Perl, the situation is much different because you can use many more characters in a name. That \w now includes 102,724, which you can count yourself:

use 5.014;

my $count = 0;

foreach my $ord ( 0 .. 0x10ffff ) {
  my $char = chr( $ord );
  
  $count++ if $char =~ /\w/;
  }

say "count is $count";

You can sorta fix this with the new /a pattern modifier, which limits the character classes to their ASCII, old-school versions. You have to have Perl 5.14:

use 5.014;

$candidate =~ qr/
  \A             # beginning of string
  [\$%\@]     # sigil
  [a-z_]      # first character
  \w+         # others can have digits, now only ASCII
  \z             # end of string
  /aix;

That’s not really a fix because those aren’t all the letters you can use in a Perl variable name. If you wanted to handle all of the names, what would have to go in that character range besides a-z? In the space of all Unicode characters that match the \w, how many ranges do you have to construct, recognizing that there can be a lot of holes. Well, count them:

use 5.014;

my @ranges;
my $count    = 0;
my $in_range = 0;
my $start;

foreach my $ord ( 0 .. 0x10ffff ) {
  my $char = chr( $ord );
  
  if( $char =~ /\w/ ) {
    $count++;
    $start = $ord unless $in_range;
    $in_range++;
    }
  elsif( $in_range ) {
    my $end = $ord - 1;
    push @ranges, [ $start, $end ];
    $in_range = 0;
    }
  }

say "count is $count";
say "There are " . @ranges . " ranges";

There are 514 ranges you’d need to list in your character class. You can cut that down by excluding the ranges that case-fold onto another character already in a range. But, how much work would you have to do to figure that out?

So, that’s not going to work. You were lucky that there were only two ranges to list in the ASCII version. Don’t transfer that technique to what you have to do now.

You’re doing all of this work because your name can’t start with a decimal digit because those are for the regular expression capture buffers. And that’s where the trouble starts. However, with Unicode properties, the fix turns out to be even simpler, both in concept and implementation. There’s a property for characters that are legal as identifier starting characters, ID_Start, and a property for characters that can come after that, ID_Continue:

$candidate =~ qr/
  \A             # beginning of string
  [\$%\@]          # sigil
  \p{ID_Start}     # first
  \p{ID_Continue}+ # others can have digits, now only ASCI
  \z             # end of string
  /x;

You have to remember, though, that these are characters, not grapheme clusters, so you still might have one thing that you’d call a character since the human definition is much looser than the term of art.

For some fussiness with decompositions of characters, Unicode Standard Annex #31 recommends improved properties:

$candidate =~ qr/
  \A             # beginning of string
  [\$%\@]          # sigil
  \p{XID_Start}     # first
  \p{XID_Continue}+ # others can have digits, now only ASCI
  \z             # end of string
  /x;

The trick is knowing that these properties are there. The list in perluniprops is very long and not very informative. You might also want to read Tom Christiansen’s Stackoverflow answer to What characters are allowed in Perl identifiers?.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Reddit Post to Slashdot Post to StumbleUpon Post to Technorati

Leave a Reply

All comments are moderated. See our comment policy.

Your email address will not be published. Required fields are marked *

*

Mark up Perl code with <pre class="brush:perl"></pre>. You do not need to escape HTML inside <pre>.

You can also use <a href="" title=""> <b> <blockquote cite=""> <cite> <code> <em> <i> <pre class=""> <strong>