Matching Perl identifiers is a lot harder now

In the Learning Perl Student Workbook (first edition), I had an exercise to match a Perl variable with a regular expression. This is supposed to be a simple exercise with a simple answer, so I excluded any special variables, such as $1 or ${^UNICODE}.

A long time ago in a Perl far, far away, when character classes were much smaller (see Know your character classes under different semantics), the pattern seems simple:

$candidate =~ qr/\A[\$%\@][a-zA-Z_]\w+\z/;

If you want to read that easier, you can use the /x modifier to add insignificant whitespace and comments:

$candidate =~ qr/
	\A             # beginning of string
	[\$%\@]     # sigil
	[a-zA-Z_]   # first character
	\w+         # others can have digits
	\z             # end of string
	/x;

You can make this slightly simpler with case insensitivity:

$candidate =~ qr/
	\A             # beginning of string
	[\$%\@]     # sigil
	[a-z_]      # first character
	\w+         # others can have digits
	\z             # end of string
	/ix;

Now, in Perl, the situation is much different because you can use many more characters in a name. That \w now includes 102,724, which you can count yourself:

use 5.014;

my $count = 0;

foreach my $ord ( 0 .. 0x10ffff ) {
	my $char = chr( $ord );
	
	$count++ if $char =~ /\w/;
	}

say "count is $count";

You can sorta fix this with the new /a pattern modifier, which limits the character classes to their ASCII, old-school versions. You need Perl 5.14:

use 5.014;

$candidate =~ qr/
	\A             # beginning of string
	[\$%\@]     # sigil
	[a-z_]      # first character
	\w+         # others can have digits, now only ASCII
	\z             # end of string
	/aix;

That’s not really a fix because those aren’t all the letters you can use in a Perl variable name. If you wanted to handle all of the names, what would have to go in that character range besides a-z? In the space of all Unicode characters that match the \w, how many ranges do you have to construct, recognizing that there can be a lot of holes. Well, count them:

use 5.014;

my @ranges;
my $count    = 0;
my $in_range = 0;
my $start;

foreach my $ord ( 0 .. 0x10ffff ) {
	my $char = chr( $ord );
	
	if( $char =~ /\w/ ) {
		$count++;
		$start = $ord unless $in_range;
		$in_range++;
		}
	elsif( $in_range ) {
		my $end = $ord - 1;
		push @ranges, [ $start, $end ];
		$in_range = 0;
		}
	}

say "count is $count";
say "There are " . @ranges . " ranges";

There are 514 ranges you’d need to list in your character class. You can cut that down by excluding the ranges that case-fold onto another character already in a range. But, how much work would you have to do to figure that out?

So, that’s not going to work. You were lucky that there were only two ranges to list in the ASCII version. Don’t transfer that technique to what you have to do now.

You’re doing all of this work because your name can’t start with a decimal digit because those are for the regular expression capture buffers. And that’s where the trouble starts. However, with Unicode properties, the fix turns out to be even simpler, both in concept and implementation. There’s a property for characters that are legal as identifier starting characters, ID_Start (but it doesn’t include the underscore), and a property for characters that can come after that, ID_Continue.

$candidate =~ qr/
	\A             # beginning of string
	[\$%\@]          # sigil
	[_\p{ID_Start}]   # first
	\p{ID_Continue}+ # others can have digits, now only ASCI
	\z             # end of string
	/x;

You have to remember, though, that these are characters, not grapheme clusters, so you still might have one thing that you’d call a character since the human definition is much looser than the term of art.

For some fussiness with decompositions of characters, Unicode Standard Annex #31 recommends improved properties:

$candidate =~ qr/
	\A             # beginning of string
	[\$%\@]          # sigil
	[_\p{ID_Start}]   # first
	\p{XID_Continue}+ # others can have digits, now only ASCI
	\z             # end of string
	/x;

The trick is knowing that these properties are there. The list in perluniprops is very long and not very informative. You might also want to read Tom Christiansen’s Stackoverflow answer to What characters are allowed in Perl identifiers?.

3 thoughts on “Matching Perl identifiers is a lot harder now”

  1. \p{ID_Start} and \p{XID_Start} don’t match the underscore so you’d need to use [\p{ID_Start}_] and [\p{XID_Start}_], respectively, in those examples. To my knowledge this is the only difference between Perl identifiers and the default Unicode identifiers as defined in UAX #31. It would be useful to add Perl-specific identifier properties to the language like \p{Perl_ID_Start} and \p{Perl_ID_Continue}.

  2. How close does this come? (for just the identifier part, not “variable”/”package” bits, obviously)

    /^[^\W0-9]\w*\z/
    
    1. You’re using the character class shortcuts, which means you’re possibly invoking the Unicode Bug. We don’t know which characters those will match.

Comments are closed.