Learning Perl Challenge: Find duplicates

This is the second novice challenge. I’ll give you some problem which you should be able to solve with just the stuff we present in Learning Perl (including using modules, so, most of Perl). A week or so later, I’ll post a solution.

For this one, given a single directory of files containing possible duplicated files, find the files that might be duplicates. You only need to print the duplicated files and print their names. If you want to remove the duplicated files, ensure that you have a backup!

There are some modules that might be helpful:

If you are especially motivated, also search through any subdirectories that you find.

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

Can you learn Perl from an old Learning Perl?

A question at Yahoo! Answers asks “Is it okay to learn perl with an out-dated book?”. This showed up in my Google Alerts in a new link attractor site that wants to get referral and ad revenue. It’s a four year old question, which puts it in the desert between Perl 5.8 and Perl 5.10.

I’ve seen this question several times, and I always wonder why people want to settle for such an old book. This questioner got it from a thrift store for a penny, but most people don’t say why. Is it hard to get books where they are? Did they get it cheap in a used book store, or find it in the trash? As someone who wants to make money writing about Perl, I wonder why people either don’t have the opportunity or are reluctant to pay $20 to $40 to get the latest information and the old information usually explained better. I fully realize that prices and availability outside the United States can be a problem. If you can’t get the print book, you might be able to get the ebook version, but you need a credit card to complete the purchase from either the O’Reilly website or Google Books. As with my answers to technical questions, I’d like to know why the old-book situation exists, but people often don’t tell us why they want to settle for an ancient edition.

Before I give my answer, you should remember that I get royalties from Learning Perl, from the fourth edition on, and with increasing participation in each subsequent edition, I get a greater share. It’s in my interest that everyone gets the latest version of this book.

The reader’s interest is usually learning enough Perl to use the Perl they have available to them. Unless someone is turning on a particular computer for the first time in 10 years and want to stick with that ancient version of the operating system, I’d expect the earliest Perl they might have is something in the 5.8 series.

The questioner specifically asks about a book the covers Perl 5.004. That’s a version from 1997 to 1998, depending on the subversion you use. If the book is Learning Perl, that’s probably the second edition, released in the summer of 1997. That’s 15 years ago.

If your book is pink, it’s too old

I don’t think that you need to learn Perl from the latest book. Learning Perl, Sixth Edition covers up to Perl 5.14, but if you aren’t using Perl 5.14, you probably don’t need that edition. You might get by with Learning Perl, Fifth Edition, which covers up to Perl 5.10. If you aren’t using Perl 5.10, which is the latest unsupported version, at least for another couple of months, you might get by Learning Perl, Fourth Edition, which covers up through Perl 5.8. Even though the perl developers now release a new version of Perl every year, Perl users upgrade much more slowly, especially since each new version since Perl 5.10 has few, if none, compelling features for most people.

The further back in time you go, however, the more current practice you miss. When we update a book, we also roll in the changes to style and idiom that develop as practicing Perl programmers learn more about the consequences of the old idioms. For instance, the three-argument open was introduced in Perl 5.6, but wasn’t stressed by the community until much later. Along with that, Perl 5.6 introduced autovivified scalar filehandles. I mentioned some of these in Why we teach bareword file handles. This version of Perl was a big change in Perl practice, and as such, Perl 5.6, released in 2000, is the earliest version that I think you should even consider learning. That means that Learning Perl, Third Edition, a version in which I have no financial stake, is the minimum edition you should use.

Most of the things that you’ll learn from earlier books are still perfectly good Perl, as far as the latest compiler is concerned. The perl developers have been very keen toward supporting old Perl scripts so that the code you wrote in 1995 still works with the latest perl. Very few features (such as pseudohashes) that have ever disappeared.

There’s another point that comes up in one of the answers. The “best answer” says that it takes about a year to write a book and that by the time it hits the shelves, it’s already out of date. Neither of those things is true for Learning Perl. We do a new edition in about six months, and we plan it according to the next version of perl by following the development of the experimental version. By the time the book hits the shelf, we’re usually right on time for the latest version of Perl. For Programming Perl, Fourth Edition, we’re even a little ahead since we cover some Perl 5.16 features, and that’s not even out yet.

There’s a better (correct) way to case fold

We show you the wrong way to do a case insensitive sort in Learning Perl, 6th Edition showed many of Perl’s Unicode features, which we had mostly ignored in all of the previous editions (despite Unicode support starting in Perl v5.6). In our defense, it wasn’t an easy thing to do without CPAN modules before the upcoming Perl v5.16.

In the “Strings and Sorting” chapter, we show this subroutine:

sub case_insensitive { "\L$a" cmp "\L$b" }

In the Unicode world, that doesn’t work (which I explain in Fold cases properly at The Effective Perler). With Perl v5.16, we should use the new fc built-in which does case folding according to Unicode’s rules:

use v5.16; # when it's released
sub case_insensitive { fc($a) cmp fc($b) }

We could use the double-quote case shifter \F to do the same thing:

use v5.16; # when it's released
sub case_insensitive { "\F$a" cmp "\F$b" }

Without Perl v5.16, we could use the Unicode::CaseFold module which defines an fc function.

Learning Perl Challenge: rhich (Answer)

Here’s my answer for the first Learning Perl Challenge, where I asked people to create a program like `which` but takes a pattern as an argument.

This is a program I had already written. As I recall, I wrote it at the 2011 Perl QA Workshop in Amsterdam. I’d wanted the program for a long time and finally broke down to write it.

My answer (I give myself a B+)

Here’s the program I wrote. I didn’t clean this up or otherwise hide any ugliness. That I did something in any particular way isn’t an endorsement for that technique. This is something I spent five minutes on, and, once it worked, stopped thinking about it:

use 5.012;

use File::Spec;

my $regex = $ARGV[0];

my @paths = get_path_components();

foreach my $path ( @paths ) {
	if( ! -e $path ) {
		warn "path $path does not exist\n";
		next;
		}
	elsif( ! -d $path ) {
		warn "path $path is not a directory\n";
		next;
		}
	elsif( opendir my $dh, $path ) {
		my @commands =
			map     {
				if( -l ) {
					my $target = readlink;
					"$_ -> $target";
					}
				else { $_ }
				}
			grep    { -x }
			map     { File::Spec->catfile( $path, $_ ) } 
			grep    { /$regex/ } 
			readdir $dh;
		
		next unless @commands;

		print join "\n", @commands, '';
		}
	else {
		warn "Could not read directory for $path: $!\n";		
		}
	}
	
sub get_path_components {
	use Config;
	my $separator = $Config{path_sep} // ':'; #/
	my @parts = split /$separator/, $ENV{PATH};
	}

The heart of the program is the last elsif, which is a long list pipeline:

			map     {
				if( -l ) {
					my $target = readlink;
					"$_ -> $target";
					}
				else { $_ }
				}
			grep    { -x }
			map     { File::Spec->catfile( $path, $_ ) } 
			grep    { /$regex/ } 
			readdir $dh;

Read these from the bottom up. I use readdir to get the list of files in the current path and filter that using the regex. Notice that I have a problem: I never checked that pattern for validity. I should have done that earlier.

The bottom map puts that filename together with the path using File::Spec, and the grep right above it checks for paths that are executable (which wasn’t an explicit requirement).

The top map looks for symbolic links so it can modify the string to show the line target. That wasn’t part of the challenge either.

My output looks like:

% rhich perl
/usr/local/bin/perl-5.10.0 -> /usr/local/perls/perl-5.10.0/bin/perl
/usr/local/bin/perl-5.6.2 -> /usr/local/perls/perl-5.6.2/bin/perl
/usr/local/bin/perl-5.8.9 -> /usr/local/perls/perl-5.8.9/bin/perl
/usr/local/bin/perl5.10.1 -> /usr/local/perls/perl-5.10.1/bin/perl5.10.1
/usr/bin/cpanp-run-perl
/usr/bin/cpanp-run-perl5.10
/usr/bin/cpanp-run-perl5.12

Answers from others

I won’t reproduce all of the answers here since you can see them in the comments for the original post.

Anonymous Coward, score: B+

The first response, from Anonymous Coward, fixes one thing I did not do. This person checks that the file is not a directory, which can also be executable:

      if ($file =~ m/\Q$ARGV[0]\E/) {
        say STDOUT "$path/$file" if -x "$path/$file" and not -d _;
      }

However, this solution fails because the pattern in nerfed by the \Q, which literalizes all of the regex metacharacters. Patterns such as perl[56] won’t work unless that’s a literal substring in program name. Besides that small mistake, the program is fine.

Scott

Scott’s answer is fine for a Perl beginner, as he admits. He does something I hadn’t thought to do by disallowing multiple arguments:

die 'Usage: rhich [filename]' if @ARGV != 1;

It might be interesting to have multiple patterns, which are sometimes easier than one complicated pattern. I would have added an newline to the end of that die string so I don’t have to see the file and line number information.

He could use File::Spec, but that’s not a huge deal if the structure of the program is right:

      print $path."/".$_,"\n";  #Prints out the FILENAME and PATH

Jack Maney, score: B

Jack added several features to his solution. He supports the -a switch so his solution acts more like which, which normally stops at the first match.

use Getopt::Long qw(:config pass_through);

my $all=0;

GetOptions('a'=>\$all);

He sets up a hash to store his finds:

my %output;
$output{$_}=[] foreach(@ARGV);

The meat of the program is in this little bit. If the path matches, it goes into the hash:

        foreach my $pattern (@patterns_to_check)
        {
            if($file=~/$pattern/)
            {
                push @{$output{$pattern}},$dir . $dir_sep . $file;
                unless($all)
                {
                    @patterns_to_check=grep{$_ ne $pattern}@patterns_to_check;
                }
            }
        }

There’s a couple of opportunities for improvement here. First, the $dir . $dir_sep . $file is fragile. It assumes more about the filesystem than we really know. Simply gluing strings together with a separator might not always work. I don’t think that many people are using VMS though, so maybe it doesn’t matter.

The grep confused me for a bit until I realized what it was doing (a comment might be nice). It takes out the pattern if we didn’t use -a to get all the matches. I still feel a bit weird about that. I don’t like changing aggregates as I’m iterating through them. I think I might have turned the structure inside out to have the patterns on the outside and the files on the inside then use last to skip the rest of the files once I found one.

Meir, score: B-

While most solutions checked only the filename for a a match, Meir’s answer matches against the entire path:

	foreach my $i (glob "$d/*")
	{
		print "$i\n" if ( $i =~ /${fl}/i );
	}

This means it might match the directory name. For instance, I have a ~bin/perls directory in my path, so if I search for “perl”, all programs in that directory, including things like xsubpp and splain, match. The problem there is the glob, whose return value includes the path according to the string argument. Give it an absolute and partial path, and those parts show up in its return value. If you want to keep using glob, you can chdir or strip off the directory. Neither of those are that attractive. Other than that minor problem, it’s the same basic structure as most answers.

Tudor Constantin, score: B

Tudor’s answer has the same problem as Meir, which he used as the basis for his answer, but he has a lot more going on in the list he gives to foreach:

$_ =~ /${fl}/i and print("$_\n") foreach ( map { glob( "$_/*" or '.' ) } split( $Config{path_sep},$ENV{PATH} ) ) ;

Kevin Biskar, score: B+

Kevin’s answer is like the previous two glob answers, but he fixes it with chdir. He didn’t include the hints about using Config for the path separator, and he constructs paths on his own:

my $pattern = \$ARGV[0];

foreach my $dir (split /:/, $ENV{PATH}) {
	chdir $dir;
	foreach my $file (glob '*') {
		if ($file =~ /$$pattern/) {
			print "$dir/$file\n"
		}
	}
}

The reference and dereference aren’t doing much there, but they aren’t really doing any harm either.

Although this program isn’t portable, it works on my system. The problems aren’t in the structure and are easy to fix, which is often a fine way to develop. You don’t paint yourself into any corners and the problems are local instead or global.

The Challenge winner

It was tough to pick a winner here. I really liked Anonymous Coward’s solution, but the program has a serious bug since it can’t actually search patterns. Kevin Biskar’s solution is good, but he missed some of the portability hints in the challenge. If I had to choose, I’d have to go with Kevin since his program works.

However, everyone who submitted is already ahead of the pack. I’m sure there are many people who never submitted their solutions, and many more people who never developed their own solution. It’s a tough thing to let everyone see what you tried.

The Aftermath

Taking some hints from everyone’s solutions, I modified my program:

#!/usr/local/perls/perl-5.14.2/bin/perl

use v5.12;
use strict;
use warnings;

use File::Spec;
use Regexp::Assemble;

my $ra = Regexp::Assemble->new;

while ( my( $index, $arg ) = each @ARGV ) {
	my $re = eval { qr/$arg/ };
	die "The pattern [$arg] at position $index is not a valid perl regex\n\t$@" if $@;
	$ra->add( $re );
	}
	
my $regex = $ra->re;
 
my @paths = get_path_components();

foreach my $path ( @paths ) {
	if( ! -e $path ) {
		warn "path $path does not exist\n";
		next;
		}
	elsif( ! -d $path ) {
		warn "path $path is not a directory\n";
		next;
		}
	elsif( opendir my $dh, $path ) {
		my @commands =
			map     {
				if( -l ) {
					my $target = readlink;
					"$_ → $target";
					}
				else { $_ }
				}
			grep    { -x -f and ! -d _ }
			map     { File::Spec->catfile( $path, $_ ) } 
			grep    { /$regex/ } 
			readdir $dh;
		
		next unless @commands;

		print join "\n", @commands, '';
		}
	else {
		warn "Could not read directory for $path: $!\n";		
		}
	}
	
sub get_path_components {
	use Config;
	my $separator = $Config{path_sep} // ':';
	my @parts = split /$separator/, $ENV{PATH};
	}

I use Regexp::Assemble to build up a single pattern based on all the patterns on the command line. Before I add each pattern, though, I build a regex with qr//. To go through command-line patterns, I use the Perl v5.12 version of each on @ARGV so I can get the position of each element with the value:

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;

while ( my( $index, $arg ) = each @ARGV ) {
	my $re = eval { qr/$arg/ };
	die "The pattern [$arg] at position $index is not a valid perl regex\n\t$@" if $@;
	$ra->add( $re );
	}
	
my $regex = $ra->re;

Later, when I’m filtering paths, I check that I have a plain file and that it’s not a directory:

			grep    { -x -f and ! -d _ }

This might look a bit weird, but it’s using the Perl v5.10 stacked file test operators. The -x -f each operate on their default argument, $_. The ! -d uses the virtual filehandle _ that re-uses the fetched data from the previous file test.

I didn’t bother supporting -a, which my program already does. I’ve never been that curious about the first result if I’m searching by a pattern.