There’s a better (correct) way to case fold

We show you the wrong way to do a case insensitive sort in Learning Perl, 6th Edition showed many of Perl’s Unicode features, which we had mostly ignored in all of the previous editions (despite Unicode support starting in Perl v5.6). In our defense, it wasn’t an easy thing to do without CPAN modules before the upcoming Perl v5.16.

In the “Strings and Sorting” chapter, we show this subroutine:

sub case_insensitive { "\L$a" cmp "\L$b" }

In the Unicode world, that doesn’t work (which I explain in Fold cases properly at The Effective Perler). With Perl v5.16, we should use the new fc built-in which does case folding according to Unicode’s rules:

use v5.16; # when it's released
sub case_insensitive { fc($a) cmp fc($b) }

We could use the double-quote case shifter \F to do the same thing:

use v5.16; # when it's released
sub case_insensitive { "\F$a" cmp "\F$b" }

Without Perl v5.16, we could use the Unicode::CaseFold module which defines an fc function.

Matching Perl identifiers is a lot harder now

In the Learning Perl Student Workbook (first edition), I had an exercise to match a Perl variable with a regular expression. This is supposed to be a simple exercise with a simple answer, so I excluded any special variables, such as $1 or ${^UNICODE}.

A long time ago in a Perl far, far away, when character classes were much smaller (see Know your character classes under different semantics), the pattern seems simple: Continue reading “Matching Perl identifiers is a lot harder now”

A new Unicode appendix

I’ve added a new appendix to Learning Perl to handle all of the Unicode stuff I was having difficultly integrating into the other chapters.

Our goal has always been to present just the information you need without getting into distracting details. The problem with Unicode is that there are a lot of distracting details. Not only that, you have to learn some things in tandem. We can’t talk about Unicode strings without introducing strings, but at the same time, we want to start with Unicode as the basis for strings.

I wanted to have a lot of that stuff in the Strings chapter, but a lot of the Perl Unicode stuff lives in modules. We do talk about modules later in the book, but I want to use some of them earlier.

Any beginning book is going to have this problem. You need to ignore some stuff to at least get started. As such, I gave up on trying to cram all the Unicode stuff into the chapters and put most of it into a new appendix. This also means that if people want to ignore some of the Unicode stuff, which I don’t recommend, they can. But, they shouldn’t. So, read the whole book, even the appendices!

Regex classes under Unicode

This week in The Effective Perler, I posted about the oddness of character classes. In Know your character classes under different semantics”, I showed that the trusty character class shortcuts \w, \w, and \s that we know from the first edition aren’t the same thing now. In fact, they haven’t been the same thing since the fourth edition. As I’ve said before, we have basically ignored Unicode despite its support since Perl 5.6. Now we’re paying the Unicode tax; I just have to integrate this into the Learning Perl.

unichar, a small Unicode character test program

Re-writing Learning Perl to cover Unicode means I have to figure out how to type some of the characters that don’t show up on my keyboard. Not only that, I need to figure out their character names and code points for the examples. I want to convert from any of those (name, code point, character) to a description of the character. I want something like this:

$ perl unichar ã
Processing ã
		match       grapheme
		code point  U+00E3
		decimal     227
		name        LATIN SMALL LETTER A WITH TILDE
		character   ã

I wrote a short program I called unichar, which I have on github.

There are some interesting parts of the script (which might change since I’m still tinkering with it). Even though my locale is set to en_US.UTF-8 and the command-line arguments are UTF-8, the script still doesn’t see them that way so I have to decode them as UTF-8. The decode subroutine from Encode takes whatever I have and turns it into a UTF-8 string. In this case, I do that for all the elements of @ARGV:

use Encode qw(decode);
use I18N::Langinfo qw(langinfo CODESET); 

my $codeset = langinfo(CODESET);
@ARGV = map { decode $codeset, $_ } @ARGV;

There are some other interesting bits in there too, but they are a bit advanced for Learning Perl.

Here are some more examples of the output. I handle unprintable and invisible characters specially:

$ perl unichar 䣱
Processing 䣱
		match       grapheme
		code point  U+48F1
		decimal     18673
		name        
		character   䣱

$ perl unichar ↞
Processing ↞
		match       grapheme
		code point  U+219E
		decimal     8606
		name        LEFTWARDS TWO HEADED ARROW
		character   ↞

$ perl unichar U+2057
Processing U+2057
		match       code point
		code point  U+2057
		decimal     8279
		name        QUADRUPLE PRIME
		character   ⁗

$ perl unichar "TAMIL LETTER HA"
Processing TAMIL LETTER HA
		match       name
		code point  U+0BB9
		decimal     3001
		name        TAMIL LETTER HA
		character   ஹ

$ perl unichar 0x05d0
Processing 0x05d0
		match       code point
		code point  U+05D0
		decimal     1488
		name        HEBREW LETTER ALEF
		character   א

$ perl unichar "CYRILLIC CAPITAL LETTER I WITH GRAVE"
Processing CYRILLIC CAPITAL LETTER I WITH GRAVE
		match       name
		code point  U+040D
		decimal     1037
		name        CYRILLIC CAPITAL LETTER I WITH GRAVE
		character   Ѝ

$ perl unichar 0x9
Processing 0x9
		match       code point
		code point  U+0009
		decimal     9
		name        CHARACTER TABULATION
		character   

$ perl unichar 0x07
Processing 0x07
		match       code point
		code point  U+0007
		decimal     7
		name        BELL
		character