Learning Perl in Works in Progress

Want a sneak peek at Learning Perl, Sixth Edition? Over at The Perl Review, subscribers have access to early versions of the books that I’m working on. I call it “Works in Progress”. It’s available to the sort of people that can help make the books as good as they can be without being easily available to all the people who will just add the content to their own websites.

Since I’m working on Learning Perl this week, I’ve uploaded the output of O’Reilly’s automated DocBook build system. To be sure, it’s a work in progress. You’ll see a lot of mistakes, unfinished bits, and so on, but feel free to point out anything weirdness that you find. I’m especially interested in stuff that’s missing that you think I should cover.

Here’s a sample, which you’ll recognize as O’Reilly Nutshell format:

To follow what’s changing, watch the updates category.

Updates to Chapter 12, “File Test Operators”

[This post notes differences between the fifth and sixth editions.]

This chapter probably doesn’t deserve an update here because almost nothing changed. Most of the updates is just make all the code examples consistent. When I added the Perl 5.10 updates for the stacked file test operators, I used a style that wasn’t quite my own, but not quite the one Tom and Randal had already used in the book. It’s more jarring in this chapter than in Chapter 15 (“Smart matching”), a completely new chapter in the fifth edition, because you can see two different styles on the same page. And, I’ve updated Chapter 15 too.

There is one area where I can use some feedback though. We say:

Don’t worry if you don’t know what some of the other file tests mean—if you’ve never heard of them, you won’t be needing them. But if you’re curious, get a good book about programming for Unix.

However, we don’t give any suggestions for what a good book might be. What would you choose?

“captures” versus “memories”, “group” versus “buffer”

The term “memories” to label the side effects of parentheses has fallen out of favor. The new hotness is “capture group”, although that has sometimes showed up as “capture buffer” in the documentation. Karl Williamson, however, purged the docs of “capture buffer”, so you shouldn’t see that anywhere in Perl 5.14’s docs. This mostly affects Chapter 8, where we introduce the match variables, even though we have grouping and backreferences in Chapter 7.

I’m not so sure I like “groups” everywhere though. I think that’s the right term to apply to the particular parentheses that triggered the capture, but not necessarily the thing actually captured. It’s the difference between asking which team is in the Super Bowl and who’s on the Super Bowl team.

I don’t really care that much, though, because there’s one overriding concern: we need to use the same terms that are in the documentation so people have the right search terms.

Perl 6 has a thing called captures, but that’s a completely different beast.

Regex classes under Unicode

This week in The Effective Perler, I posted about the oddness of character classes. In Know your character classes under different semantics”, I showed that the trusty character class shortcuts \w, \w, and \s that we know from the first edition aren’t the same thing now. In fact, they haven’t been the same thing since the fourth edition. As I’ve said before, we have basically ignored Unicode despite its support since Perl 5.6. Now we’re paying the Unicode tax; I just have to integrate this into the Learning Perl.

unichar, a small Unicode character test program

Re-writing Learning Perl to cover Unicode means I have to figure out how to type some of the characters that don’t show up on my keyboard. Not only that, I need to figure out their character names and code points for the examples. I want to convert from any of those (name, code point, character) to a description of the character. I want something like this:

$ perl unichar ã
Processing ã
		match       grapheme
		code point  U+00E3
		decimal     227
		name        LATIN SMALL LETTER A WITH TILDE
		character   ã

I wrote a short program I called unichar, which I have on github.

There are some interesting parts of the script (which might change since I’m still tinkering with it). Even though my locale is set to en_US.UTF-8 and the command-line arguments are UTF-8, the script still doesn’t see them that way so I have to decode them as UTF-8. The decode subroutine from Encode takes whatever I have and turns it into a UTF-8 string. In this case, I do that for all the elements of @ARGV:

use Encode qw(decode);
use I18N::Langinfo qw(langinfo CODESET); 

my $codeset = langinfo(CODESET);
@ARGV = map { decode $codeset, $_ } @ARGV;

There are some other interesting bits in there too, but they are a bit advanced for Learning Perl.

Here are some more examples of the output. I handle unprintable and invisible characters specially:

$ perl unichar 䣱
Processing 䣱
		match       grapheme
		code point  U+48F1
		decimal     18673
		name        
		character   䣱

$ perl unichar ↞
Processing ↞
		match       grapheme
		code point  U+219E
		decimal     8606
		name        LEFTWARDS TWO HEADED ARROW
		character   ↞

$ perl unichar U+2057
Processing U+2057
		match       code point
		code point  U+2057
		decimal     8279
		name        QUADRUPLE PRIME
		character   ⁗

$ perl unichar "TAMIL LETTER HA"
Processing TAMIL LETTER HA
		match       name
		code point  U+0BB9
		decimal     3001
		name        TAMIL LETTER HA
		character   ஹ

$ perl unichar 0x05d0
Processing 0x05d0
		match       code point
		code point  U+05D0
		decimal     1488
		name        HEBREW LETTER ALEF
		character   א

$ perl unichar "CYRILLIC CAPITAL LETTER I WITH GRAVE"
Processing CYRILLIC CAPITAL LETTER I WITH GRAVE
		match       name
		code point  U+040D
		decimal     1037
		name        CYRILLIC CAPITAL LETTER I WITH GRAVE
		character   Ѝ

$ perl unichar 0x9
Processing 0x9
		match       code point
		code point  U+0009
		decimal     9
		name        CHARACTER TABULATION
		character   

$ perl unichar 0x07
Processing 0x07
		match       code point
		code point  U+0007
		decimal     7
		name        BELL
		character