Learning Perl Challenge: Find duplicates

This is the second novice challenge. I’ll give you some problem which you should be able to solve with just the stuff we present in Learning Perl (including using modules, so, most of Perl). A week or so later, I’ll post a solution.

For this one, given a single directory of files containing possible duplicated files, find the files that might be duplicates. You only need to print the duplicated files and print their names. If you want to remove the duplicated files, ensure that you have a backup!

There are some modules that might be helpful:

If you are especially motivated, also search through any subdirectories that you find.

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

18 thoughts on “Learning Perl Challenge: Find duplicates”

I don’t particularly like my solution and I’m looking forward to a more “perlish” one…

I just digest potential files since digests are kind of expensive…

#!/usr/bin/env perl

use Digest;
use feature 'say';
use File::Find;
use strict;
use utf8;
use warnings;

die 'Usage: duplicates dir' if @ARGV != 1;
die "$ARGV[0] is not a directory" if not -d $ARGV[0];

my @directories = $ARGV[0];
my %files = ();
my %candidates = ();
my %duplicates = ();

find \&wanted, @directories;
while (my ($name, $size) = each %files) {
  if (exists $candidates{$size}) {
    push $candidates{$size}, $name;
  } else {
    $candidates{$size} = [$name];
  }
}
%files = ();
while (my ($size, $names) = each %candidates) {
  delete $candidates{$size} if @{$names} == 1
}
foreach my $names (values %candidates) {
  foreach my $name (@{$names}) {
    my $sha256 = Digest->new("SHA-256");
    if (open my $handle, $name) {
      $sha256->addfile($handle);
      my $digest = $sha256->hexdigest;
      close $handle;
      if (exists $duplicates{$digest}) {
        push $duplicates{$digest}, $name;
      } else {
        $duplicates{$digest} = [$name];
      }
    } else {
      warn "Could not open $name: $!";
    }
  }
}
%candidates = ();
while (my ($size, $names) = each %duplicates) {
  say "@{$names} are duplicates" if @{$names} > 1;
  delete $duplicates{$size} if @{$names} = 1;
}


sub wanted {
  $files{$File::Find::name} = -s _ if -f;
}

Anonymous Coward says:

March 27, 2012 at 11:48
I’d like to take some time now and comment on my first solution as I didn’t manage to when I first posted it.

Exactly one directory is allowed as a parameter since I use File::Find to do a recursive search.

Only plain files are of interest and since equal sizes are a necessity for two files beeing duplicates I build a hash with the name as the keys and the size as the values.

The hash of files is used to generate another hash of arrays with now the sizes as keys and all the names of same sized files as members of the array as the value.

Hashes are deleted when not needed anymore to save memory.

Obviously only sizes matter for which more than one file exists. A digest is calculated and used as a key of a new hash. At the end this new hash will have as values arrays and only those with more than one element will be of interest, since those are duplicates.

Since MD5 and SHA1 have their issues (have a look at the sixth comment at https://freedom-to-tinker.com/blog/felten/report-crypto-2004 😉 SHA-256 is used as the digest algorithm.

A digest in hexadecimal form is not really needed and memory could be saved by using a binary digest which could be displayed using unpack when necessary.

Anyway, I’ve rewritten my solution, although I am still not very satisfied:
```
use Digest;
use feature 'say';
use File::Find;
use strict;
use utf8;
use warnings;

die 'Usage: duplicates dir' if @ARGV != 1;
die "$ARGV[0] is not a directory" if not -d $ARGV[0];
my %files;
my $algorithm = 'SHA-256';
find \&wanted, shift;
while (my ($size, $hash) = each %files) {
  while (my ($digest, $array) = each %{$hash}) {
    if ($digest ne 'notdigest' and $digest ne 'notcalculated' and scalar @{$array} > 1) {
      say "The files of size $size bytes and digest $digest are duplicates:";
      say join "\n", @{$array};
    }
  }
}

sub wanted {
  if (-f) {
    my $size = -s _;
    if (exists $files{$size}) {
      if (-r _) {
        my $digest = Digest->new($algorithm);
        $digest->addfile($File::Find::name);
        push @{$files{$size}{$digest->hexdigest}}, ($File::Find::name);
        if (exists $files{$size}{notdigest}) {
          if (-r $files{$size}{notdigest}[0]) {
            $digest->addfile($files{$size}{notdigest}[0]);
            push @{$files{$size}{$digest->hexdigest}}, delete $files{$size}{notdigest}[0];
          } else {
            push @{$files{$size}{notcalculated}}, delete $files{$size}{notdigest}[0];
          }
          delete $files{$size}{notdigest}
        }
      } else {
        push @{$files{$size}{notcalculated}}, ($File::Find::name);
      }
    } else {
      $files{$size} = {notdigest => [$File::Find::name,],};
    }
  }
}
```
This second version takes the ideas of the first solution and combines them directly in the wanted sub. I hope it’s more straightforward.

I actually had a bit of fun throwing this one together. It not only prints out the duplicates, but it prints them out in a way that you can see how the duplicates are partitioned (ie if you have files A,B,C,D, and E with A, B, and C duplicates; D and E duplicates, and A not a duplicate of D, then it groups the two clusters of files together.

There are up to three command line options: -r for a recursive subdirectory search, -l to do a line-by-line comparison of files (if not enabled, then files are compared via a SHA1 hash), and a directory name (default of .).

use strict;
use warnings;
use File::Compare;
use File::Spec;
use Digest::SHA1;
use Getopt::Long qw(:config pass_through);
use Set::Scalar;

my $search_subdir=0; #flag to determine whether or not subdirectories should be searched.

#flag to determine whether or not we do a line-by-line comparison.
#If not enabled (default), then the SHA1 hashes of each file will be used for comparison.
my $line_by_line=0; 

GetOptions('recursive|r'=>\$search_subdir,'line_by_line|l'=>\$line_by_line);

my $dir=".";

warn "WARNING: All arguments except " . $ARGV[0] . " will be ignored.\n" if @ARGV>1;

$dir=$ARGV[0] if @ARGV;

die "Argument $dir is not a directory" unless (-d $dir);

#For reasons I don't understand, File::Find doesn't seem to like relative directories...
my $abs_dir=File::Spec->rel2abs($dir);

my @files=();

if($search_subdir)
{
	#Do a depth-first grab of files in $dir and all subdirectories.
	use File::Find;
	find(\&grab_files,$abs_dir);
}
else #Only grab files from $dir.
{
	opendir(my $dh,$abs_dir) or die $!;

	@files=map{File::Spec->catfile($abs_dir,$_)}grep{-f $_}readdir($dh);

	closedir($dh);
}

unless(@files)
{
	print "No files found in directory $dir\n";
	exit 0;
}

#Array of Set::Scalar objects, 
#each of which represent files that are (pairwise) duplicate.
#So, this forms a partition of the subset of @files that has a duplicate.

my @duplicates=();

#We now compare all pairs of files in @files.
#The comparison function (given below) depends on whether or not -l is enabled.

foreach my $file1(@files)
{
	foreach my $file2(@files)
	{
		next if $file1 eq $file2; #only comparing distinct pairs of files!
		
		if(compare_files($file1,$file2)) #If they're the same...
		{
			#first, see if $file1 is in any element of @duplicates.
			my $found=0; #flag to see if we found $file1 or $file2
			
			foreach my $set (@duplicates)
			{
				if($set->has($file1))
				{
					$set->insert($file2);
					$found=1;
					last;
				}
				elsif($set->has($file2))
				{
					$set->insert($file1);
					$found=1;
					last;
				}
			}
			
			unless($found) #If we didn't find $file1 or $file2 in @duplicates, add a new set!
			{
				push @duplicates,Set::Scalar->new($file1,$file2);
			}
		}
	}
}

#Now we print out the results.

unless(@duplicates)
{
	print "No duplicate files found!\n";
	exit 0;
}

my $hl="\n\n" . ('~' x 20) . "\n\n"; #Horizontal "line" to keep duplicate sets nice and separated.

print "Duplicates:\n";

foreach my $set (@duplicates)
{
	print $hl;
	my @elements=$set->elements;
	foreach(sort @elements)
	{
		print "$_\n";
	}
	print $hl;
}


sub compare_files
{
	my ($file1,$file2)=@_;
	
	if($line_by_line) #using File::Compare::compare
	{
		my $ret_val=eval{compare($file1,$file2)};
		
		die "File::Compare::compare encountered an error: " . $@ if $@;
		
		return 1 if $ret_val==0; #compare() returns 0 if the files are the same...
		
		return undef;
	}
	else #Otherwise, we use Digest::SHA1.
	{
		open(my $fh1,"<",$file1) or die $!;
		open(my $fh2,"<",$file2) or die $!;
		
		my $sha1=Digest::SHA1->new;
		
		$sha1->addfile($fh1); #Reads file.
		my $hex1=$sha1->hexdigest; #40 byte hex string.
		
		$sha1->reset;
		$sha1->addfile($fh2);
		my $hex2=$sha1->hexdigest;
		
		close($fh1);
		close($fh2);
		
		return $hex1 eq $hex2;
	}
}

sub grab_files
{
	my $file=$File::Find::name; #/relative/path/to/file/filename
	if((-r $file) and (-f $file))
	{
		push @files,$file;
	}
	elsif((-f $file) and !(-r $file))
	{
		#Customizing the warning message in case we don't have a LOGNAME value in %ENV (eg in Windows)
		my $warning_msg="WARNING: File $file is not readable";
		$warning_msg.=" by user " . $ENV{LOGNAME} if exists $ENV{LOGNAME};
		$warning_msg .="\n";
		warn $warning_msg;
	}
}

#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use Data::Dumper;
use feature qw( say );
use Digest::MD5 qw( md5_hex );

my $folder = shift;
die "\n\tUsage:\t$0 name-of-the-folder-to-search-in\n\n" unless $folder;

my $duplicate_files = {};

sub _process_folder{
    my $folder_name = shift;

    foreach my $file_name ( glob  "$folder_name/*" ){
        #nothing to do if current, or parent directory
        next if $file_name ~~ [ qw/. ../ ];

        # $file_name might actually be a folder
        if ( -d $file_name ){
            _process_folder($file_name);
            next ;
        };

        my $file_content = read_file( $file_name, binmode => ':raw' );
        if ( defined($duplicate_files->{ md5_hex( $file_content ) }) ){
            push @{ $duplicate_files->{ md5_hex( $file_content ) } }, $file_name;
        } else {
            $duplicate_files->{ md5_hex( $file_content ) } = [ $file_name ];
        };
    }

    return;
}

_process_folder( $folder );


foreach my $checksum ( keys %$duplicate_files ){
    say 'The files: '.Dumper( $duplicate_files->{$checksum} ).' have checksum '.$checksum if scalar( @{ $duplicate_files->{$checksum} } ) > 1;
}

#!/usr/bin/env perl

use common::sense;
use Digest::MD5 qw(md5_hex);
use File::Find; #not needed yet, later for recursion
use Getopt::Std; #not needed yet, later for other path than working directory

# TODO: option -r (recursive); parameter DIR (if different from actual directory).

my @files=glob'*';
my %filehashes;
my $dupfree=1;

foreach my $file(@files) {
  if (-d $file or -l $file) {
    next;
  }
  open FILE, "< $file";
  my $digest=md5_hex();
  if (exists $filehashes{$digest}) {
    $dupfree=0;
    push @{$filehashes{$digest}}, $file;
    print "duplicates detected: ";
    foreach $file (@{$filehashes{$digest}}) {
      print "$file  ";
    }
  print "\n";
  next;
  }
  $filehashes{$digest}=[$file];
}

if ($dupfree==1) {
  say "no duplicates detected.";
}

exit 0;

after compressing:

#!/usr/bin/env perl

use strict;
use warnings;

use Digest::MD5 qw(md5_hex);

my %nrOfDuplicates;

while (<>){
	next if -d || -l;
	open my $fle, "$_";
	binmode $fle;
	print $_ if $nrOfDuplicates{md5_hex()}++;
	close $fle;
}

#!/usr/bin/env perl

use common::sense;
use Digest::MD5 qw(md5_hex);
use File::Find;
use Getopt::Std;

my $opts = {};
getopts('r', $opts);

my $dir = shift // '.';
die "Cannot access $dir\n" unless (-x $dir);

my @files;
my %filehashes;
my $dups;

if ($opts->{r}) {
  find(
    sub {
      push @files, $File::Find::name if (-f && !-l);
    }, $dir);
}
else {
  @files = glob "$dir/*";
}

foreach my $file (@files) {
  if (-d $file or -l $file) {
    next;
  }
  open FILE, "< $file" or die "Cannot open $file\n";
  my $digest = md5_hex();
  close FILE or die "Cannot close $file\n";
  push @{$filehashes{$digest}}, $file;
}

foreach my $digest (keys %filehashes) {
  my $array_ref = $filehashes{$digest};
  if (scalar(@$array_ref) > 1) {
    $dups++;
    say "these files are duplicates:";
    foreach my $file (@$array_ref) {
      say $file;
    }
    say;
  }
}

say "no duplicates detected." unless $dups;

The main idea behind the following code is to first classify the files per size so that we can skip every file with a singular size. This way we can avoid calculating the SHA1 hash for every file which don’t have any other with the same size, hopefully skipping most of them. Then, we can calculate SHA1 only for every file inside a same sized class.

I used File::Slurp to make it consize. It would be better to use Digest::SHA1’s OO interface so that we could read the files in chunks and avoid the danger of slurping huge files in memory.

#!/usr/bin/env perl

use 5.014;
use utf8;
use autodie;
use warnings;
use File::Find;
use File::Slurp;
use Digest::SHA1 qw(sha1_hex);

# Classify files per size
my %size2files;
find(
    sub { push @{$size2files{-s _}}, $File::Find::name if -f; },
    @ARGV,
);

foreach my $same_sizes (grep {scalar(@$_) > 1} values %size2files) {
    # Classify files per SHA1, for each group of more than one same
    # sized files
    my %sha2files;
    foreach my $file (@$same_sizes) {
	push @{$sha2files{sha1_hex(read_file($file))}}, $file;
    }

    # Print duplicates found
    foreach my $same_shas (grep {scalar(@$_) > 1} values %sha2files) {
	say join(' ', 'Duplicates: ', @$same_shas);
    }
}

Sorry, but I can’t resist making it a little bit shorter and clearer:

#!/usr/bin/env perl

use 5.014;
use utf8;
use autodie;
use warnings;
use File::Find;
use File::Slurp;
use Digest::SHA1 qw(sha1_hex);

# Classify files per size
my %size2files;
find(
    sub { push @{$size2files{-s _}}, $File::Find::name if -f },
    @ARGV,
);

# For each group of two or more same sized files ...
foreach my $same_sizes (grep {@$_ > 1} values %size2files) {
    # Classify files per SHA1
    my %sha2files;
    foreach my $file (@$same_sizes) {
        push @{$sha2files{sha1_hex(read_file($file))}}, $file;
    }

    # Print duplicates found
    foreach my $same_shas (grep {@$_ > 1} values %sha2files) {
        say join(' ', 'Duplicates:', @$same_shas);
    }
}

Nagilum says:

June 29, 2012 at 11:48

“-f” will also find symlinks to regular files if I’m not mistaken.

David W says:

March 26, 2012 at 14:52
I took the simplicity approach. I skipped using Getopt::Long and just assumed the first two arguments were the directories. One less part of the program to worry about.

I decided to use File::Find although I prefer File::Find::Object. File::FInd is just freaky. You use the subroutine to find your files, but you have to use a global (non-subroutine defined) list to store what you find if you want to use it outside the subroutine. Plus, File::Find uses package variables. However, since all of the other modules used are part of the Perl standard package, I decided to use File::Find since it’s also a standard module. I wish File::Find::Object became a standard module.

I decided to use Digest::SHA over Digest::MD5 because Digest::SHA’s add_file method allows me to add the file without opening it first. Probably makes no difference in the efficiency in the code (the file has to be opened somewhere, whether in Digest::SHA->addfile or my code, but it makes my code cleaner.

One of the things you can do with with the File::Find::find’s wanted function is to embed it in the call. For small wanted subroutines, the readability isn’t harmed, and you don’t have to search for the wanted code.

By the way, I make the assumption that file bar/foo in directory #1 is the same file as bar/foo in directory #2, but that bar/bar/foo in directory #1 is not the same file as bar/foo in directory #2 since it’s in a different subdirectory.
```
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use File::Find;
use Digest::SHA;

use constant ALG => 256;

#
# Get Directories
#

my $dir_1 = shift;
my $dir_2 = shift;

if (not $dir_2) {
    die qq(Usage = compare dir1 dir2\n);
}

my @file_list;
find ( sub {
	return if not -f;
	(my $file_name = $File::Find::name) =~ s/^$dir_1\///;
	push @file_list => $file_name;
    }, $dir_1);

foreach my $file (@file_list) {
    next if not ("$dir_2/$file");

    my $sha1 = Digest::SHA->new(ALG);
    my $sha2 = Digest::SHA->new(ALG);

    $sha1->addfile("$dir_1/$file");
    $sha2->addfile("$dir_2/$file");

    if ($sha1->hexdigest eq $sha2->hexdigest) {
	say qq(Files "$dir_1/$file" and "$dir_2/$file" are duplicates);
    }
}
```

TIMTOWTDI:

use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
use File::Slurp;

my %data;

while (<*>) {
    push @{$data{md5_hex(read_file($_))}},$_ if -f $_;
}
for my $md5 (keys %data) {
    print "$md5: ",join(", ",@{$data{$md5}}),"\n" if scalar @{$data{$md5}} > 1;
}

#!/usr/bin/env perl

use Modern::Perl '2012';
use Digest::MD5;
use File::Find;

sub get_checksum($)
{
    open my $fh, "<", $_[0] or return undef;
    local $/;
    return Digest::MD5::md5_hex();
}

push @ARGV, '.' unless scalar @ARGV;

my %checksum;

find({
        wanted => sub {
            return if -d $_ || ! -r $_;
            return unless defined (my $key = get_checksum($_));
            if (exists $checksum{$key})
            {
                # We have already seen another file with this checksum
                say $checksum{$key}, " ", $_;
            }
            else
            {
                $checksum{$key} = $_;
            }
        },
        no_chdir => 1,
    },
    @ARGV);

#!/usr/bin/env perl

use strict;
use warnings;
use File::Find;
use File::Basename;
use Digest::MD5;

($#ARGV == 0) or die "Usage: " . basename($0) . " \n";

my %hashes = ();

find \&wanted, @ARGV;

sub get_hash($)
{
    open(FILE, $_);
    return Digest::MD5->new->addfile(*FILE)->hexdigest;
}

sub wanted
{
    $hashes{$File::Find::name} = get_hash($_) if -f $_;
}

for my $path (sort { $hashes{$a} cmp $hashes{$b} } keys %hashes)
{
    print "$path\n"
        if ((grep {/$hashes{$path}/} values %hashes) > 1);
}

#!/usr/bin/env perl

use strict;
use warnings;
use File::Find;
use File::Basename;
use Digest::MD5;

($#ARGV == 0) or die "Usage: " . basename($0) . " <directory>\n";

my %hashes = ();

find \&wanted, @ARGV;

sub get_hash($)
{
   open(FILE, $_);
   return Digest::MD5->new->addfile(*FILE)->hexdigest;
}

sub wanted
{
   $hashes{$File::Find::name} = get_hash($_) if -f $_;
}

for my $path (sort { $hashes{$a} cmp $hashes{$b} } keys %hashes)
{
   print "$path\n"
       if ((grep {/$hashes{$path}/} values %hashes) > 1);
}

Jack Maney says:

April 7, 2012 at 23:03

Actually, I realized that I could cut my processing time in half by not comparing a pair of files twice. The modified code is up at http://pastebin.com/DzDWYdG6

#!/usr/bin/perl

use strict;
use warnings;

use Digest::MD5;
use File::Spec;

die "require one parameter\n" if @ARGV != 1;
opendir DIR, $ARGV[0]
    or die "opendir error: $!";
my @files = readdir DIR;
my %fp;

foreach (@files) {
    next if /^\./;
    $_ = File::Spec->catfile($ARGV[0], $_);
    next if -d ;
    open FILE, $_ or die "open error: $!";
    binmode FILE;
    $fp{$_} = Digest::MD5->new->addfile(*FILE)->hexdigest;
}

foreach (@files) {
    next unless exists $fp{$_};
    my $curr = $fp{$_};
    my @dup = ($_);
    delete $fp{$_};
    foreach my $key (keys %fp) {
        if ($fp{$key} eq $curr) {
            push @dup, $key;
            delete $fp{$key};
        }
    }
    if (@dup > 1) {
        print "file (@dup) are the same...their content is:\n";
        open FILE, $dup[0] or die "open error: $!";
        print while ();
    }
}

I know the challenge is over, but I wanted to add one that uses my File::chdir::WalkDir which is like File::Find, but has an (IMO) unique interface.

#!/usr/bin/env perl

use strict;
use warnings;

use File::chdir::WalkDir;
use Digest::MD5 qw/md5_hex/;

my %data;

my $callback = sub {
  my ($file, $dir) = @_;
  my $md5 = do {
    open my $fh, '<', $file or die "$!";
    local $/;
    md5_hex();
  };
  push @{ $data{$md5} }, [$dir, $file];
};

walkdir(shift || '.', $callback);

foreach my $files (values %data) {
  next unless @$files > 1;
  print "Possible matches:\n";
  {
    local $" = ": ";
    print "@$_ \n" for @$files;
  }
  print "\n";
}

Comments are closed.