Learning Perl Challenge: finding duplicates (Answer)

March’s challenge was to find duplicate files. I often write a quick and dirty program to do this, having forgotten what I called the program or having failed to copy it to all of my accounts. This program helps me find extra copies of files that are taking up my space on Dropbox. Did I really mean to have two copies of the OS X Lion installer? I had already downloaded it, moved it into the wrong folder, and forgotten about it. I downloaded it again, moved it into the right folder, and took up another 2% of my Dropbox space. Or, iTunes stupidly stores the same file twice, in the same folder even, with slightly different names (and that’s in Dropbox too).

Before I go through the posted solutions for the challenge, I’ll show you mine. It’s not at all impressive and not that efficient. It’s not even pretty, but this is the program I hacked up a long time ago to do this. I spend five minutes thinking about it, let it run, and moved on:

use File::Find;
use Digest::MD5;

find( \&wanted, grep { -e } @ARGV );

our %digests;
sub wanted {
	return if( -d $File::Find::name or -l $File::Find::name );
	my $fh;
	unless( open $fh, '<', $File::Find::name ) {
		warn "$File::Find::name: $!";
		return;
		}

	my $digest = Digest::MD5->new->addfile($fh)->hexdigest;
	
	$digests{$digest} .= "$File::Find::name\000";
	}

foreach my $digest ( keys %digests ) {
	my @files = split /\000/, $digests{$digest};
	next unless @files > 1;
	print join( "\n\t", @files ), "\n\n";
	}

As some people noted in their solutions, various digests have problems. We want a digest that is a short representation of content and is unlikely to be the same for different content. The chances of a “collision” depend on the algorithm as well as the number of times we try it. If we only digest one file, we never have a collision. If digest ten files, we expect a collision to be extremely rare. When (not if) we get to billions of files, weak algorithms will fail us. Since I don’t care that much, I use MD5. But, I’ve segregated all of those details in a subroutine, so I can easily change that.

Aside from that, this is essentially an exercise in file system traversal and hashes. However I decide to compare files, I need to store earlier results. I limit myself to the concepts in Learning Perl, so I don’t use an array reference as the value in %digests. Instead, I store all filenames in a single string by separating them with a null byte (\000). Later,
I split them again to get the list of files.

Since I’m writing a simple program, I merely report the duplicates as I find them. I want to inspect the files myself before I delete or move them.

My solution has problems. I have to wait for find to do its work before I start seeing a report of the duplicates. That’s rather annoying. However, when I use this, I usually don’t care how fast I get the results. It is a bit of ugly programming, but it works.

The solutions

Although people have their particular coding styles, there are three ways these programs can be conceptually different:

Finding the files
Digesting the files
Reporting the duplicates

Finding files

There are a few ways to find the files. The easiest is File::Find, which does a depth-first search by using a callback. It’s the easiest thing to start with, and many people did. It’s what I used.

Tudor makes his own recursive subroutine to get all the files out of a directory and its subdirectories, effectively reinventing File::Find:

sub _process_folder{
    my $folder_name = shift;

    foreach my $file_name ( glob  "$folder_name/*" ){
        #nothing to do if current, or parent directory
        next if $file_name ~~ [ qw/. ../ ];

        # $file_name might actually be a folder
        if ( -d $file_name ){
            _process_folder($file_name);
            next ;
        };

        my $file_content = read_file( $file_name, binmode => ':raw' );
        if ( defined($duplicate_files->{ md5_hex( $file_content ) }) ){
            push @{ $duplicate_files->{ md5_hex( $file_content ) } }, $file_name;
        } else {
            $duplicate_files->{ md5_hex( $file_content ) } = [ $file_name ];
        };
    }

    return;
}

Mr. Nicholas uses a glob to get all the files in the current working directory:

while (<*>) {
    push @{$data{md5_hex(read_file($_))}},$_ if -f $_;
}

I’d rather use glob if I wanted a list of files so I can save the angle brackets for reading lines. That’s what ulric did:

my @files=glob'*';

Eric used readdir, and gets all the files at once:

opendir DIR, $ARGV[0]
    or die "opendir error: $!";
my @files = readdir DIR;
my %fp;

The other solutions that don’t use File::Find only work with the current directory, which is what I specified for the basic parts of this problem. I’ll have to give no points to those solutions for this part—neither extra points or demerits. They pass, while other solutions did the extra parts to handle subdirectories.

Digesting files

On his first go, Anonymous Coward went for Digest, which handles many types of digests with the same interface. He went with SHA-256. On his first try, he hardcoded the digest and used a variable name tied to that particular digest:

    my $sha256 = Digest->new("SHA-256");
    if (open my $handle, $name) {
      $sha256->addfile($handle);
      my $digest = $sha256->hexdigest;

That hexdigest call is important, and one of the issues that I usually forget. The digest by itself is just a big number. To turn it into what I usually expect, a string of hexadecimal digits, I call the right method. The digest method returns a binary string. I could have used b64digest to get a Base-64 encoded version.

On his next go around, he moved those details higher in the program, making way for an eventual configuration outside the program:

my $algorithm = 'SHA-256';

        my $digest = Digest->new($algorithm);

Most people reached directly for either Digest::MD5 or Digest::SHA1, both which export functions for these:

Jack Maney put his digest details in a subroutine, which compartmentalizes them in a part the rest of the program doesn’t need to know about. If he wants to change the digest, he changes the subroutine:

sub compare_files
{
    my ($file1,$file2)=@_;
 
    if($line_by_line) #using File::Compare::compare
    {
        my $ret_val=eval{compare($file1,$file2)};
 
        die "File::Compare::compare encountered an error: " . $@ if $@;
 
        return 1 if $ret_val==0; #compare() returns 0 if the files are the same...
 
        return undef;
    }
    else #Otherwise, we use Digest::SHA1.
    {
        open(my $fh1,"< ",$file1) or die $!;
        open(my $fh2,"<",$file2) or die $!;
 
        my $sha1=Digest::SHA1->new;
 
        $sha1->addfile($fh1); #Reads file.
        my $hex1=$sha1->hexdigest; #40 byte hex string.
 
        $sha1->reset;
        $sha1->addfile($fh2);
        my $hex2=$sha1->hexdigest;
 
        close($fh1);
        close($fh2);
 
        return $hex1 eq $hex2;
    }
}

Tudor went for Digest::MD5, which exports md5_hex. I don’t particularly like this function because I always think it should take a filename. Tudor uses File::Slurp‘s read_file Instead I have to give it the actual file contents. Tudor has a bit of a problem because he does the computation twice for each file:

        my $file_content = read_file( $file_name, binmode => ':raw' );
        if ( defined($duplicate_files->{ md5_hex( $file_content ) }) ){
            push @{ $duplicate_files->{ md5_hex( $file_content ) } }, $file_name;
        } else {
            $duplicate_files->{ md5_hex( $file_content ) } = [ $file_name ];
        };

Ulrich does the same thing, but does the computation once and stores it in a variable:

  my $digest=md5_hex();
  if (exists $filehashes{$digest}) {
    $dupfree=0;
    push @{$filehashes{$digest}}, $file;
    print "duplicates detected: ";
    foreach $file (@{$filehashes{$digest}}) {
      print "$file  ";
    }

Instead of Digest::MD5, Gustavo used Digest::SHA1, which has a similar interface . He wraps it all up in a single statement:

    push @{$sha2files{sha1_hex(read_file($file))}}, $file;

Mr. Nicholas did the same with Digest::MD5:

    push @{$data{md5_hex(read_file($_))}},$_ if -f $_;

Javier used the object form of Digest::MD5 that takes a filehandle. He has a subroutine that just returns the digest:

sub get_hash($)
{
   open(FILE, $_);
   return Digest::MD5->new->addfile(*FILE)->hexdigest;
}

Eric also used the object form, but used the module directly in the statement:

    $fp{$_} = Digest::MD5->new->addfile(*FILE)->hexdigest;

The winner for this part of the problem would have to be a tie between Anonymous Coward setting up a flexible digest system and Javier for creating a short subroutine. Anonymous Coward comes out slightly ahead I think.

Finding duplicates

Finding the duplicates is the final part of the problem. Most answers did what Anonymous Coward did. When he computed a digest, he used it as the key in a hash, and made the value an array reference. In his first go, he uses the v5.14 feature that automatically dereferences the first argument to push:

      if (exists $duplicates{$digest}) {
        push $duplicates{$digest}, $name;
      } else {
        $duplicates{$digest} = [$name];
      }

Jack Maney used Set::Scalar to keep track of duplicates. He goes through the list of files and compares them with every other file in a double foreach loop, which is a lot of work. If the two files are the same, he looks through all the sets he’s stored so far looking for a set that already contains one of the filenames so he can add the new filename:

foreach my $file1(@files)
{
	foreach my $file2(@files)
	{
		next if $file1 eq $file2; #only comparing distinct pairs of files!

		if(compare_files($file1,$file2)) #If they're the same...
		{
			#first, see if $file1 is in any element of @duplicates.
			my $found=0; #flag to see if we found $file1 or $file2

			foreach my $set (@duplicates)
			{
				if($set->has($file1))
				{
					$set->insert($file2);
					$found=1;
					last;
				}
				elsif($set->has($file2))
				{
					$set->insert($file1);
					$found=1;
					last;
				}
			}

			unless($found) #If we didn't find $file1 or $file2 in @duplicates, add a new set!
			{
				push @duplicates,Set::Scalar->new($file1,$file2);
			}
		}
	}
}

There are some good ideas there, but I’d have to revert to references to improve it by keeping the sets as values in a hash where the key is the digest. However, I’m limiting myself to whatever we have in Learning Perl for my official solution. For my unofficial solution, I would have made a single pass over @files to digest it and another pass over %digests to report them:

foreach my $file ( @files ) {
	my $digest = get_digest( $file );
	$digests{$digest} = Set::Scalar->new 
		unless defined $digests{$digest};
	$digests{$digest}->insert( $file );
	}
	
foreach my $digest ( keys %digests ) {
	next unless $digests{$digest}->size > 1;
	my $dupes = $digests{$digest}->members;
	}

Most everyone did the same thing, so points go to Anonymous Coward for getting their first.

The results

I’m not assigning a winner to first part that involved finding files. Anonymous Coward wins the second part by setting up his digest to be flexible through Digest‘s interface. Anonymous Coward narrowly pips the other solutions for reporting the duplicates because he was first, since most people used a hash with array references for values.

2 thoughts on “Learning Perl Challenge: finding duplicates (Answer)”

This one should work quite a bit faster than the rest since calculating the md5/sha hash for each file is quite wasteful. We only need to do that for files of the same size.

#!/usr/bin/perl -w

use strict;
use warnings;
use File::Find;
use Fcntl ':mode';
use Digest::SHA;

die "require one parameter\n" if @ARGV != 1;

local (%::sizes, %::name2sha, %::sha2name);

sub process() {
	return if /\.{1,2}$/;
	my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
		$atime,$mtime,$ctime,$blksize,$blocks)
		= lstat($_);
	return unless(S_IFREG == S_IFMT($mode));
	if($::sizes{$size}) {
		unless($::name2sha{$::sizes{$size}}){
			$::name2sha{$::sizes{$size}}=get_sha1($::sizes{$size});
			$::sha2name{$::sizes{$size}}=$::name2sha{$::sizes{$size}};
		}
		my $sha=get_sha1($_);
		$::name2sha{$_}=$sha;
		if($::sha2name{$sha}) {
			print "$::sha2name{$sha} -> $_\n";
		} else {
			$::sha2name{$sha}=$_;
		} 
	} else {
		$::sizes{$size}=$_;
	}
}

sub get_sha1 {
    my $name = shift;
    open FILE, $name or die "open $name error: $!";
    binmode FILE;
    my $sha1 = Digest::SHA->new(1)->addfile(*FILE)->hexdigest;
    close FILE;
    return($sha1);
}

find({wanted => \&process, no_chdir=> 1}, @ARGV);

hmm, I think there is bugfix needed:

#!/usr/bin/perl -w

use strict;
use warnings;
use File::Find;
use Fcntl ':mode';
use Digest::SHA;

die "require one parameter\n" if @ARGV != 1;

local (%::sizes, %::name2sha, %::sha2name);

sub process() {
	return if /\.{1,2}$/;
	my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
		$atime,$mtime,$ctime,$blksize,$blocks)
		= lstat($_);
	return unless(S_IFREG == S_IFMT($mode));
	if($::sizes{$size}) {
		my $sha;
		unless($::name2sha{$::sizes{$size}}){
			$sha=get_sha1($::sizes{$size});
			$::name2sha{$::sizes{$size}}=$sha;
			$::sha2name{$sha}=$::sizes{$size};
		}
		$sha=get_sha1($_);
		$::name2sha{$_}=$sha;
		if($::sha2name{$sha}) {
			print "$::sha2name{$sha} -> $_\n";
		} else {
			$::sha2name{$sha}=$_;
		} 
	} else {
		$::sizes{$size}=$_;
	}
}

sub get_sha1 {
    my $name = shift;
    open FILE, $name or die "open $name error: $!";
    binmode FILE;
    my $sha1 = Digest::SHA->new(1)->addfile(*FILE)->hexdigest;
    close FILE;
    return($sha1);
}

find({wanted => \&process, no_chdir=> 1}, @ARGV);

Comments are closed.