Strings and Sorting – Learning Perl

A use for the scalar reverse (maybe)

The reverse operator, which turns a list end to front, has a scalar context too. It’s one of the examples I use in my Learning Perl classes to note that you can’t guess what something does in context. I’ve never had a decent example for a proper use, but flipping a string around to effectively scan from the right seems interesting. Continue reading “A use for the scalar reverse (maybe)”

Word counting and Zipf’s Law

On the final day my Learning Perl class, I talk about Zipf’s Law because people now have enough Perl to read a large file, break it up into words, count those words, and sort them by their count.

The final piece of Perl involves sorting a hash by value, which we cover late in the book: Continue reading “Word counting and Zipf’s Law”

There’s a better (correct) way to case fold

We show you the wrong way to do a case insensitive sort in Learning Perl, 6th Edition showed many of Perl’s Unicode features, which we had mostly ignored in all of the previous editions (despite Unicode support starting in Perl v5.6). In our defense, it wasn’t an easy thing to do without CPAN modules before the upcoming Perl v5.16.

In the “Strings and Sorting” chapter, we show this subroutine:

sub case_insensitive { "\L$a" cmp "\L$b" }

In the Unicode world, that doesn’t work (which I explain in Fold cases properly at The Effective Perler). With Perl v5.16, we should use the new fc built-in which does case folding according to Unicode’s rules:

use v5.16; # when it's released
sub case_insensitive { fc($a) cmp fc($b) }

We could use the double-quote case shifter \F to do the same thing:

use v5.16; # when it's released
sub case_insensitive { "\F$a" cmp "\F$b" }

Without Perl v5.16, we could use the Unicode::CaseFold module which defines an fc function.

Updates to Chapter 14, “Strings and sorting”

[This post notes differences between the fifth and sixth editions.]

I did quite a bit of work to update Chapter 14, but most of it isn’t going to show up in that chapter. I initially added a long section on Unicode normalization forms, covering the difference between canonical and compatibility forms. You need to know these to properly sort Unicode strings (see Know your sort orders over at The Effective Perler).

As I went through the explanation, though, I realized that I was also going to need the same concepts for the basic Perl strings, and also for the regular expressions chapters. Even basic comparisons need the idea of equivalence, and the regular expressions even more so (and, there might soon be a /k match flag that will do that for us).

While I was writing this chapter, which includes a section on index(). Since Perl’s string operators work on characters instead of grapheme, could I find accents that way. This material didn’t make it into the book.

use utf8;

my $string = "éáabcáá\x{65}\x{301}í";    

my( $pos, $old_pos );
while( -1 != (my $pos = index $string, "\x{301}", $old_pos + 1 ) ) {
	print "Found accent at $pos\n";
	$old_pos = $pos;
	}

Since I’ve specified only one decomposed é, I get only one match:

Found accent at 8

It doesn’t find the other accents though. I could decompose the string:

use utf8;

use Unicode::Normalize;

my $string = "éáabcáá\x{65}\x{301}í";    

my $decomposed = NFD( $string );

my( $pos, $old_pos );
while( -1 != (my $pos = index $decomposed, "\x{301}", $old_pos + 1 ) ) {
	print "Found accent at $pos\n";
	$old_pos = $pos;
	}

Now I can tell that there are accents, although the positions no longer have much meaning because they don’t relate to the original string:

Found accent at 1
Found accent at 3
Found accent at 8
Found accent at 10
Found accent at 12
Found accent at 14

To get around that, I have to do a lot more work. I can go through each grapheme individually, decompose each one, and look at that:

use 5.012;
use utf8;
binmode STDOUT, ':utf8';

use Unicode::Normalize;

my $string = "éáabcáá\x{65}\x{301}í";    

my @graphemes = $string =~ m/(\X)/g;

while( my( $index, $grapheme ) = each @graphemes ) {
	my $decomposed = NFD( $grapheme );
	print "Found an accent at $index ($grapheme)\n"
		if -1 < index( $decomposed, "\x{301}" );
	}

That's fine, and it reports the right positions for the characters that have accents:

Found an accent at 0 (é)
Found an accent at 1 (á)
Found an accent at 5 (á)
Found an accent at 6 (á)
Found an accent at 7 (é)
Found an accent at 8 (í)

Notice that I use the new form of each on arrays in Perl 5.12.

That's not a very good way to do it though because I'm only looking for the ´mark. I should look for any mark:


use strict;
use warnings;

use utf8;
binmode STDOUT, ':utf8';

use Unicode::Normalize;
	
my $string = "éáåbcüá\x{65}\x{301}í";

my $array = [$string =~ m/(\X)/g];
while( my( $index, $grapheme ) = each $array ) {
	my $nfd = NFD( $grapheme );
	print "Found an accent at $index ($grapheme)\n"
		if $nfd =~ /\p{Mark}/;
	}

Now I can find all sorts of marks:

Found an accent at 0 (é)
Found an accent at 1 (á)
Found an accent at 2 (å)
Found an accent at 5 (ü)
Found an accent at 6 (á)
Found an accent at 7 (é)
Found an accent at 8 (í)

There are easier ways to do this, but I wanted to stick to just what was in Learning Perl.