Learning Perl Challenge: directory with the most files

Find the top 10 directories on your system that has the most files in it. For that, only count the files immediately under it. That is, don’t count files in subdirectories. For file, I mean just about anything that isn’t a symbolic link or weird device. Think about how you want to show the person running your program what it’s doing.

This challenge isn’t about counting so much as traversing, remembering, and displaying. How do you know what you need to handle next and which one was largest?

I’m actually writing my own version of this right now because I’m making some benchmarks on opendir versus glob and need some test cases. I could just create some new directories and make a bunch of fake files in them, but that’s no fun.

I don’t care how long your program takes, although you might. Let it run in a window (or screen) on its own. Test it on a small directory first (so, there’s a hint there).

I made a Curses version (but don’t look at it until you’ve tried your own solution!):

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

7 thoughts on “Learning Perl Challenge: directory with the most files”

use strict;
use warnings;
use Path::Iterator::Rule;
$|++;

# start in this dir
my $BASEDIR = $ARGV[0] || $ENV{HOME};

# how many top results
my $TOP_N = $ARGV[1] || 10;

# give intermediate results to make it less boring for the user
my $INTERMEDIATE_RESULTS_EVERY = $ARGV[2] || 10000;

# maps dirname to number of files in it
my $dir_to_file_number = {};

# shortcut to number of keys in dir_to_file_number
my $dirs_visited = 0;

# helper functino that prints the top-n
sub print_top_n {
    my $cur_result = 0;
    my @keys_sorted = sort { $dir_to_file_number->{$b}  $dir_to_file_number->{$a} } keys %$dir_to_file_number;
    for (@keys_sorted) { 
        last if $cur_result++ >= $TOP_N;
        printf "\n%-8d: %s", $dir_to_file_number->{$_}, $_;
    }
    print "\n";
}

# rules for traversing dirs
my $dir_rule = Path::Iterator::Rule->new
    ->dir
    ->min_depth(1)
    ;
# rules for finding files in a dir
my $file_rule = Path::Iterator::Rule->new
    ->file
    ->min_depth(1)
    ->max_depth(1)
    ;

# dir iterator
my $next_dir = $dir_rule->iter($BASEDIR, {follow_symlinks=>0, depthfirst=>0, loop_safe=>0} );
while (my $cur_dir = $next_dir->()) {

    # print current dir
    printf "\r%-70s", substr($cur_dir,0, 70);

    # file iterator
    my $next_file = $file_rule->iter( $cur_dir, {follow_symlinks=>0} );
    
    # count the files
    my $i = 0;
    $i++ while (my $cur_file = $next_file->());
    $dir_to_file_number->{$cur_dir} = $i;

    # print intermediate results if wanted
    if (++$dirs_visited % $INTERMEDIATE_RESULTS_EVERY == 0) {
        print_top_n;
        $dirs_visited = 0;
    }
}

# print final results
print_top_n;

I had a script that does just that in my ~/bin, originally using File::Find, now rewritten using Path::Iterator::Rule, after reading rjbs’ file finder modules comparison.

Might as well try doing this challenge (interesting enough).

use strict;
use warnings;
use File::Find::Rule;

sub count_files {
    my $directory = shift;
    # This works (and it's less ugly), but it's 10 seconds slower.
    # File::Find::Rule->new->file->maxdepth(1)->in($directory);
    my $dh;
    opendir $dh, $directory or return;
    grep { -f "$directory/$_" } readdir $dh;
}

my @top;
for (File::Find::Rule->new->directory->in("/")) {
    my $count = count_files($_) || 0;
    # Purely optimization, appears to save 0.5 seconds on my PC
    next if $count <= ($top[9] || [0])->[0];
    push @top, [$count, $_];
    @top = reverse sort {$a->[0]  <=> $b->[0]} @top;
    # Only top 10 elements are fine
    splice @top, 10;
}

for (@top) {
    printf "%6u %s\n", @$_;
}

Runs in 4 seconds on my PC. Probably could be optimized further, but I don’t care much. Also, I hope that using CPAN is fine.

A first attempt…

#! /usr/bin/env perl

# usage: ttdir [directory]

use common::sense;
use File::Find;

my @path=shift || '/home/qje96/learning-perl';
my $top=10;
my %dircounter;
my @counted_dirs;

find(\&count, @path);

sub count {
  $dircounter{$File::Find::dir}++ if (-f && !-l);
}

foreach my $dir (keys %dircounter) {
  push @counted_dirs, [$dircounter{$dir}, $dir];
}

@counted_dirs=sort {$b->[0] <=> $a->[0]} @counted_dirs;

for (my $rank=0; $rank<$top; $rank++) {
  printf "%5d files in %s\n", @counted_dirs[$rank]->[0], @counted_dirs[$rank]->[1];
}

exit 0;

Bugfix: in small trees there may be less than 10 directories.

#! /usr/bin/env perl

# usage: ttdir [directory]

use common::sense;
use File::Find;

my @path=shift || '/home/qje96/perl/learning-perl';
my $top=10;
my %dircounter;
my @counted_dirs;
my $dirnumber;

find(\&count, @path);

sub count {
  $dircounter{$File::Find::dir}++ if (-f && !-l);
}

foreach my $dir (keys %dircounter) {
  push @counted_dirs, [$dircounter{$dir}, $dir];
  $dirnumber++;
}

@counted_dirs=sort {$b->[0] <=> $a->[0]} @counted_dirs;

for (my $rank=0; ($rank<$top and $rank<$dirnumber); $rank++) {
  printf "%5d files in %s\n", @counted_dirs[$rank]->[0], @counted_dirs[$rank]->[1];
}

exit 0;

Added some comments, tried to use a bit better style.

#!/usr/bin/env perl

# usage: ttdir [directory]

use common::sense;
use File::Find;

my @path=shift || '.';
my $top=10;
my %dircounter;
my @counted_dirs; # array of arrays

find(\&count, @path); # let File::Find do all the work of traversing

foreach my $dir (keys %dircounter) {
  push @counted_dirs, [$dircounter{$dir}, $dir]; # add an array
}

@counted_dirs=sort {$b->[0] <=> $a->[0]} @counted_dirs; # sort descending by number of files

foreach my $dir (@counted_dirs[0..$top-1]) { # slice of top-n elements
  last unless (defined $dir);
  printf "%5d files in %s\n", @$dir;
}

exit 0;

sub count {
  $dircounter{$File::Find::dir}++ if (-f && !-l); # $_ contains filename in actual directory
}

It was fun :). Here my solution:

#!/usr/bin/env perl

use warnings;
use strict;

my $dir = $ARGV[0] || '.';
my $N = 10; # Number top directories to show
my %dirs;

opendir(DIR, $dir) or die $!;
while (my $file = readdir(DIR)) {
    next if (-f $file or $file =~ m/^\./);
    my $path = "$dir/$file";
    # path => total files (yes hidden and no symbolic link)
    $dirs{$path} = 
        scalar( grep { -f and ! -l } glob("$path/* $path/.*"));
}
closedir(DIR);

# List $N top directories sorted
for ( ( sort { $dirs{$b} <=> $dirs{$a} } keys %dirs )[0..$N-1] )
{
    exit 0 if (! $_ );
    printf("%6s %s\n", $dirs{$_}, $_)
}

Tom Legrady says:

March 9, 2013 at 22:24
I’ve posted similar challenges as brainteasers, at work, as well as using them to sort job applications into the categories: “knows Perl” and “knows the word, ‘Perl'”. I’ve found that my beautiful, well factored, beautifully documented code can be boiled down to a simple Unix pipeline, so in this case I began with that:
```
sudo find ~ -type f -print 2> /dev/null | xargs -n 1 dirname  | sort | uniq -c  | sort -n -r  | head
```
sudo because we’ll have to go into all sorts of directories, not all of which are owned by the user, even on my personal desktop.

find / -type f -print – Traverse all directories beginning at the root directory, and print out the path to each file.

2 > /dev/null – If there are weird errors because of strange names, just discard the error messages. Might not be a suitable solution if your software is running a nuclear power plant or a Mars Rover, but a great first approximation.

xargs -n 1 dirname – take each line of output from ‘find’, and consider only the path; discard the filename component.

sort – Get all identical values adjacent.

uniq -c – replace a sequence of identical lines with a single instance, preceded by the number of times it was seen. Non-adjacent instances are not collapsed, which is why the sorting is necessary beforehand.

sort -n -r – Sort the output of ‘uniq’ by the numeric count ( -n ), in descending order (-r).

head – brian wants the first ten.

Makes for a pretty good start, but it generates error messages about mis-matched quote characters. Using tr to clean out all the expected characters in filenames, to find the odd chars, I discover filenames containing ‘, `, ~, ^, %, #, +, {, }, [, ], , |plus some files with totally Chinese names. Re-reading the ‘find’ man page, I rediscover the “-print0” …. rediscover in the sense that I’ve read about it before, but never used it. Man page says,
```
 -X  Permit find to be safely used in conjunction with xargs(1).  If a file
       name contains any of the delimiting characters used by xargs(1), a diag-
       nostic message is displayed on standard error, and the file is skipped.
       The delimiting characters include single (`` ' '') and double (`` " '')
       quotes, backslash (``\''), space, tab and newline characters.

       However, you may wish to consider the -print0 primary in conjunction with
      ``xargs -0'' as an effective alternative.
```
-print0 uses null-terminated strings and -0 tells xargs to expect that intput. Changing to:
```
sudo find ~ -type f -print0 2> /dev/null | xargs -0 -n 1 dirname  | sort | uniq -c  | sort -n -r  | head
```
produces much better results:
```
$ sudo find /Users/ -type f -print0 2> /dev/null | xargs -0 -n 1 dirname |  uniq -c  | sort -n -r  | head

17458 /Users//tomlegrady/Desktop/.../Mailboxes/Clubs/TLUG/2004.mbox/Messages
```
and similar numbers for other directories.

Hmm … time to delete that directory, haven’t looked in there in nine years.

Comments are closed.