Learning Perl Challenge: directory with the most files

Find the top 10 directories on your system that has the most files in it. For that, only count the files immediately under it. That is, don’t count files in subdirectories. For file, I mean just about anything that isn’t a symbolic link or weird device. Think about how you want to show the person running your program what it’s doing.

This challenge isn’t about counting so much as traversing, remembering, and displaying. How do you know what you need to handle next and which one was largest?

I’m actually writing my own version of this right now because I’m making some benchmarks on opendir versus glob and need some test cases. I could just create some new directories and make a bunch of fake files in them, but that’s no fun.

I don’t care how long your program takes, although you might. Let it run in a window (or screen) on its own. Test it on a small directory first (so, there’s a hint there).

I made a Curses version (but don’t look at it until you’ve tried your own solution!):

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Google Buzz Send Gmail Post to LinkedIn Post to Reddit Post to Slashdot Post to StumbleUpon Post to Technorati

Leave a comment

7 Comments.

  1. use strict;
    use warnings;
    use Path::Iterator::Rule;
    $|++;
    
    # start in this dir
    my $BASEDIR = $ARGV[0] || $ENV{HOME};
    
    # how many top results
    my $TOP_N = $ARGV[1] || 10;
    
    # give intermediate results to make it less boring for the user
    my $INTERMEDIATE_RESULTS_EVERY = $ARGV[2] || 10000;
    
    # maps dirname to number of files in it
    my $dir_to_file_number = {};
    
    # shortcut to number of keys in dir_to_file_number
    my $dirs_visited = 0;
    
    # helper functino that prints the top-n
    sub print_top_n {
        my $cur_result = 0;
        my @keys_sorted = sort { $dir_to_file_number->{$b}  $dir_to_file_number->{$a} } keys %$dir_to_file_number;
        for (@keys_sorted) { 
            last if $cur_result++ >= $TOP_N;
            printf "\n%-8d: %s", $dir_to_file_number->{$_}, $_;
        }
        print "\n";
    }
    
    # rules for traversing dirs
    my $dir_rule = Path::Iterator::Rule->new
        ->dir
        ->min_depth(1)
        ;
    # rules for finding files in a dir
    my $file_rule = Path::Iterator::Rule->new
        ->file
        ->min_depth(1)
        ->max_depth(1)
        ;
    
    # dir iterator
    my $next_dir = $dir_rule->iter($BASEDIR, {follow_symlinks=>0, depthfirst=>0, loop_safe=>0} );
    while (my $cur_dir = $next_dir->()) {
    
        # print current dir
        printf "\r%-70s", substr($cur_dir,0, 70);
    
        # file iterator
        my $next_file = $file_rule->iter( $cur_dir, {follow_symlinks=>0} );
        
        # count the files
        my $i = 0;
        $i++ while (my $cur_file = $next_file->());
        $dir_to_file_number->{$cur_dir} = $i;
    
        # print intermediate results if wanted
        if (++$dirs_visited % $INTERMEDIATE_RESULTS_EVERY == 0) {
            print_top_n;
            $dirs_visited = 0;
        }
    }
    
    # print final results
    print_top_n;
    
    

    I had a script that does just that in my ~/bin, originally using File::Find, now rewritten using Path::Iterator::Rule, after reading rjbs’ file finder modules comparison.

  2. Might as well try doing this challenge (interesting enough).

    use strict;
    use warnings;
    use File::Find::Rule;
    
    sub count_files {
        my $directory = shift;
        # This works (and it's less ugly), but it's 10 seconds slower.
        # File::Find::Rule->new->file->maxdepth(1)->in($directory);
        my $dh;
        opendir $dh, $directory or return;
        grep { -f "$directory/$_" } readdir $dh;
    }
    
    my @top;
    for (File::Find::Rule->new->directory->in("/")) {
        my $count = count_files($_) || 0;
        # Purely optimization, appears to save 0.5 seconds on my PC
        next if $count <= ($top[9] || [0])->[0];
        push @top, [$count, $_];
        @top = reverse sort {$a->[0]  <=> $b->[0]} @top;
        # Only top 10 elements are fine
        splice @top, 10;
    }
    
    for (@top) {
        printf "%6u %s\n", @$_;
    }

    Runs in 4 seconds on my PC. Probably could be optimized further, but I don’t care much. Also, I hope that using CPAN is fine.

  3. A first attempt…

    #! /usr/bin/env perl
    
    # usage: ttdir [directory]
    
    use common::sense;
    use File::Find;
    
    my @path=shift || '/home/qje96/learning-perl';
    my $top=10;
    my %dircounter;
    my @counted_dirs;
    
    find(\&count, @path);
    
    sub count {
      $dircounter{$File::Find::dir}++ if (-f && !-l);
    }
    
    foreach my $dir (keys %dircounter) {
      push @counted_dirs, [$dircounter{$dir}, $dir];
    }
    
    @counted_dirs=sort {$b->[0] <=> $a->[0]} @counted_dirs;
    
    for (my $rank=0; $rank<$top; $rank++) {
      printf "%5d files in %s\n", @counted_dirs[$rank]->[0], @counted_dirs[$rank]->[1];
    }
    
    exit 0;
    

  4. Bugfix: in small trees there may be less than 10 directories.

    #! /usr/bin/env perl
    
    # usage: ttdir [directory]
    
    use common::sense;
    use File::Find;
    
    my @path=shift || '/home/qje96/perl/learning-perl';
    my $top=10;
    my %dircounter;
    my @counted_dirs;
    my $dirnumber;
    
    find(\&count, @path);
    
    sub count {
      $dircounter{$File::Find::dir}++ if (-f && !-l);
    }
    
    foreach my $dir (keys %dircounter) {
      push @counted_dirs, [$dircounter{$dir}, $dir];
      $dirnumber++;
    }
    
    @counted_dirs=sort {$b->[0] <=> $a->[0]} @counted_dirs;
    
    for (my $rank=0; ($rank<$top and $rank<$dirnumber); $rank++) {
      printf "%5d files in %s\n", @counted_dirs[$rank]->[0], @counted_dirs[$rank]->[1];
    }
    
    exit 0;

  5. Added some comments, tried to use a bit better style.

    #!/usr/bin/env perl
    
    # usage: ttdir [directory]
    
    use common::sense;
    use File::Find;
    
    my @path=shift || '.';
    my $top=10;
    my %dircounter;
    my @counted_dirs; # array of arrays
    
    find(\&count, @path); # let File::Find do all the work of traversing
    
    foreach my $dir (keys %dircounter) {
      push @counted_dirs, [$dircounter{$dir}, $dir]; # add an array
    }
    
    @counted_dirs=sort {$b->[0] <=> $a->[0]} @counted_dirs; # sort descending by number of files
    
    foreach my $dir (@counted_dirs[0..$top-1]) { # slice of top-n elements
      last unless (defined $dir);
      printf "%5d files in %s\n", @$dir;
    }
    
    exit 0;
    
    sub count {
      $dircounter{$File::Find::dir}++ if (-f && !-l); # $_ contains filename in actual directory
    }

  6. It was fun :) . Here my solution:

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    
    my $dir = $ARGV[0] || '.';
    my $N = 10; # Number top directories to show
    my %dirs;
    
    opendir(DIR, $dir) or die $!;
    while (my $file = readdir(DIR)) {
        next if (-f $file or $file =~ m/^\./);
        my $path = "$dir/$file";
        # path => total files (yes hidden and no symbolic link)
        $dirs{$path} = 
            scalar( grep { -f and ! -l } glob("$path/* $path/.*"));
    }
    closedir(DIR);
    
    # List $N top directories sorted
    for ( ( sort { $dirs{$b} <=> $dirs{$a} } keys %dirs )[0..$N-1] )
    {
        exit 0 if (! $_ );
        printf("%6s %s\n", $dirs{$_}, $_)
    }
    

  7. I’ve posted similar challenges as brainteasers, at work, as well as using them to sort job applications into the categories: “knows Perl” and “knows the word, ‘Perl’”. I’ve found that my beautiful, well factored, beautifully documented code can be boiled down to a simple Unix pipeline, so in this case I began with that:

    sudo find ~ -type f -print 2&gt; /dev/null | xargs -n 1 dirname  | sort | uniq -c  | sort -n -r  | head
    

    sudo because we’ll have to go into all sorts of directories, not all of which are owned by the user, even on my personal desktop.

    find / -type f -print – Traverse all directories beginning at the root directory, and print out the path to each file.

    2 > /dev/null – If there are weird errors because of strange names, just discard the error messages. Might not be a suitable solution if your software is running a nuclear power plant or a Mars Rover, but a great first approximation.

    xargs -n 1 dirname – take each line of output from ‘find’, and consider only the path; discard the filename component.

    sort – Get all identical values adjacent.

    uniq -c – replace a sequence of identical lines with a single instance, preceded by the number of times it was seen. Non-adjacent instances are not collapsed, which is why the sorting is necessary beforehand.

    sort -n -r – Sort the output of ‘uniq’ by the numeric count ( -n ), in descending order (-r).

    head – brian wants the first ten.

    Makes for a pretty good start, but it generates error messages about mis-matched quote characters. Using tr to clean out all the expected characters in filenames, to find the odd chars, I discover filenames containing ‘, `, ~, ^, %, #, +, {, }, [, ], , |plus some files with totally Chinese names. Re-reading the ‘find’ man page, I rediscover the “-print0″ …. rediscover in the sense that I’ve read about it before, but never used it. Man page says,

     -X  Permit find to be safely used in conjunction with xargs(1).  If a file
           name contains any of the delimiting characters used by xargs(1), a diag-
           nostic message is displayed on standard error, and the file is skipped.
           The delimiting characters include single (`` ' '') and double (`` " '')
           quotes, backslash (``\''), space, tab and newline characters.
    
           However, you may wish to consider the -print0 primary in conjunction with
          ``xargs -0'' as an effective alternative.
    

    -print0 uses null-terminated strings and -0 tells xargs to expect that intput. Changing to:

    sudo find ~ -type f -print0 2&gt; /dev/null | xargs -0 -n 1 dirname  | sort | uniq -c  | sort -n -r  | head
    

    produces much better results:

    $ sudo find /Users/ -type f -print0 2&gt; /dev/null | xargs -0 -n 1 dirname |  uniq -c  | sort -n -r  | head
    
    17458 /Users//tomlegrady/Desktop/.../Mailboxes/Clubs/TLUG/2004.mbox/Messages
    

    and similar numbers for other directories.

    Hmm … time to delete that directory, haven’t looked in there in nine years.

Leave a Reply

All comments are moderated. See our comment policy.

Your email address will not be published. Required fields are marked *

*

Mark up Perl code with <pre class="brush:perl"></pre>. You do not need to escape HTML inside <pre>.

You can also use <a href="" title=""> <b> <blockquote cite=""> <cite> <code> <em> <i> <pre class=""> <strong>