Updates to Chapter 8, “Matching with Regular Expressions”

[This post notes differences between the fifth and sixth editions.]

There are a couple of interesting updates for Chapter 8. The small change is the slight modification of a footnote. We mentioned that the performance problem of the match variables $& and friends wouldn’t be solved before Perl 6. However, with Perl 5.10’s introduction of the /p match operator flag, problem solved!

Chapter 8 also has a subtle shift in thinking about anchors. Perl 5 introduced the \A, \Z, and \z regular expression anchors. Somehow, never made the shift from the Perl 4 anchors ^ and $. Even after Perl Best Practices pointed out the problem, we failed to update the Llama

I’d never really bothered to check when Perl introduced \A until today. That’s a task I do quite frequently: when did some feature show up in Perl? I could just go through all the tarballs, unpack them, and look at the documentation, but there’s an easier way. Since I have a clone of the perl repository, I have access to the entire perl development history. Each release has a tag, and I can list all the tags:

$ git tag
perl-1.0
perl-1.0.15
perl-1.0.16
perl-2.0
perl-2.001
perl-3.000
perl-3.044
perl-4.0.00
perl-4.0.36
perl-5.000
perl-5.000o
perl-5.001
perl-5.001n
perl-5.002
perl-5.002_01
perl-5.003
...

If I want to see what was going on in a particular release, I checkout the appropriate tag:

git checkout perl-5.000

Now I can see the state of the repo at the point of that release. Sure enough, C<\A>, C<\Z>, and C<\z> are in the documentation back then.

Updates to Chapter 3, “Lists and Arrays”

[This post notes differences between the fifth and sixth editions.]

I went into this chapter thinking that it would be fairly easy: just fix up any possible typos or grammar problems, then move on. However, I was reading through Appendix B and noticed that in previous editions that we had ignored splice. We mention it all the way at the end of the book, but it almost takes as much space to say that we aren’t going to cover to say that we will. So, I move it out of Appendix B and into Chapter 3.

You would think that this chapter would be a natural to pull in things like List::Utils, but we actually save that for later. We make some Perl-pure versions of max in the “Subroutines” chapter, then later in the “Perl Modules” chapter we can abandon the examples we used to illustrate the Perl concepts so the reader can use List::Utils.

Updates to Chapter 14, “Strings and sorting”

[This post notes differences between the fifth and sixth editions.]

I did quite a bit of work to update Chapter 14, but most of it isn’t going to show up in that chapter. I initially added a long section on Unicode normalization forms, covering the difference between canonical and compatibility forms. You need to know these to properly sort Unicode strings (see Know your sort orders over at The Effective Perler).

As I went through the explanation, though, I realized that I was also going to need the same concepts for the basic Perl strings, and also for the regular expressions chapters. Even basic comparisons need the idea of equivalence, and the regular expressions even more so (and, there might soon be a /k match flag that will do that for us).

While I was writing this chapter, which includes a section on index(). Since Perl’s string operators work on characters instead of grapheme, could I find accents that way. This material didn’t make it into the book.

use utf8;

my $string = "éáabcáá\x{65}\x{301}í";    

my( $pos, $old_pos );
while( -1 != (my $pos = index $string, "\x{301}", $old_pos + 1 ) ) {
	print "Found accent at $pos\n";
	$old_pos = $pos;
	}

Since I’ve specified only one decomposed é, I get only one match:

Found accent at 8

It doesn’t find the other accents though. I could decompose the string:

use utf8;

use Unicode::Normalize;

my $string = "éáabcáá\x{65}\x{301}í";    

my $decomposed = NFD( $string );

my( $pos, $old_pos );
while( -1 != (my $pos = index $decomposed, "\x{301}", $old_pos + 1 ) ) {
	print "Found accent at $pos\n";
	$old_pos = $pos;
	}

Now I can tell that there are accents, although the positions no longer have much meaning because they don’t relate to the original string:

Found accent at 1
Found accent at 3
Found accent at 8
Found accent at 10
Found accent at 12
Found accent at 14

To get around that, I have to do a lot more work. I can go through each grapheme individually, decompose each one, and look at that:

use 5.012;
use utf8;
binmode STDOUT, ':utf8';

use Unicode::Normalize;

my $string = "éáabcáá\x{65}\x{301}í";    

my @graphemes = $string =~ m/(\X)/g;

while( my( $index, $grapheme ) = each @graphemes ) {
	my $decomposed = NFD( $grapheme );
	print "Found an accent at $index ($grapheme)\n"
		if -1 < index( $decomposed, "\x{301}" );
	}

That's fine, and it reports the right positions for the characters that have accents:

Found an accent at 0 (é)
Found an accent at 1 (á)
Found an accent at 5 (á)
Found an accent at 6 (á)
Found an accent at 7 (é)
Found an accent at 8 (í)

Notice that I use the new form of each on arrays in Perl 5.12.

That's not a very good way to do it though because I'm only looking for the ´mark. I should look for any mark:


use strict;
use warnings;

use utf8;
binmode STDOUT, ':utf8';

use Unicode::Normalize;
	
my $string = "éáåbcüá\x{65}\x{301}í";

my $array = [$string =~ m/(\X)/g];
while( my( $index, $grapheme ) = each $array ) {
	my $nfd = NFD( $grapheme );
	print "Found an accent at $index ($grapheme)\n"
		if $nfd =~ /\p{Mark}/;
	}

Now I can find all sorts of marks:

Found an accent at 0 (é)
Found an accent at 1 (á)
Found an accent at 2 (å)
Found an accent at 5 (ü)
Found an accent at 6 (á)
Found an accent at 7 (é)
Found an accent at 8 (í)

There are easier ways to do this, but I wanted to stick to just what was in Learning Perl.

Updates to Chapter 4, “Subroutines”

[This post notes differences between the fifth and sixth editions.]

There’s not much to write about Perl subroutines that we haven’t written before, but that doesn’t mean that this chapter gets a pass in the update. This is only chapter 4, so it’s still early in the book. Up to this point, we have only covered the basics of Perl scalars and arrays. Once we get into subroutines, we start to talk about scoping variables to a block, and it’s here that we introduce lexical variables.

Once we show of my, we can tell people about strict. Still, that’s nothing new. However, since the last time we wrote about strict, it was something that you had to enable on your own. Perl 5.12 added the feature that you’d get that for free by requiring the version of Perl.

Before Perl 5.12:

use 5.010;
use strict;

Starting with Perl 5.12:

use 5.012; # strict for free

We could tell them how to turn it off, but we still won’t do that until Intermediate Perl.