README.md

   1 dups
   2 ====
   3
   4 Find duplicate files in N given directory trees. Where "duplicate" is defined
   5 as having the same (and non-0) file size and MD5 hash digest.
   6
   7 It is roughly equivalent to the following one-liner (included as `dups.sh`):
   8 ```sh
   9 find . -type f -print0 | xargs -0 -P $(nproc) -I % md5sum % | awk '{digest = $1;  sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf "    %s\n", paths[digest, i]} } } }'
  10 ```
  11
  12 which, when indented, looks like:
  13 ```sh
  14 find . -type f -print0 \
  15 | xargs -0 -P $(nproc) md5sum \
  16 | awk '
  17     {
  18         digest = $1
  19         sub("^" $1 " +", "")
  20         path = $0
  21         paths[digest, ++count[digest]] = path
  22     }
  23
  24     END {
  25         for (digest in count) {
  26             n = count[digest]
  27             if (n > 1) {
  28                 print(digest, n)
  29                 for (i=1; i<=n; i++) {
  30                     printf "    %s\n", paths[digest, i]
  31                 }
  32             }
  33         }
  34     }'
  35 ```
  36
  37 and works well-enough, but is painfully slow (for instance, it takes around 8
  38 minutes to process my home directory, whereas `dups` takes around 8 seconds).
  39
  40 Originally, my main motivation for rewriting the above script in OCaml was
  41 simply to avoid dealing with file paths containing newlines and spaces (the
  42 original rewrite was substantially simpler than it currently is).
  43
  44 I since realized that, on the _input_, the problem is avoided by delimiting the
  45 found paths with the null byte, rather than a newline and in AWK doing an
  46 ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`).
  47
  48 However, on the _output_, I still don't know of a _simple_ way to escape the
  49 newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf`
  50 there's `%q`).
  51
  52 In any case, I now have 2 other reasons to continue with this project:
  53 1. The speed-up is a boon to my UX (thanks in large part to optimizations
  54    suggested by @Aeronotix);
  55 2. I plan to extend the feature set, which is just too-unpleasant to manage in
  56    a string-oriented PL:
  57     1. byte-by-byte comparison of files that hash to the same digest, to make
  58        super-duper sure they are indeed the same and do not just happen to
  59        collide;
  60     2. extend the metrics reporting;
  61     3. output sorting options.
  62
  63 Example
  64 -------
  65 After building, run `dups` on the current directory tree:
  66
  67 ```sh
  68 $ make
  69 Finished, 0 targets (0 cached) in 00:00:00.
  70 Finished, 5 targets (0 cached) in 00:00:00.
  71
  72 $ ./dups .
  73 e40e3c4330857e2762d043427b499301 2
  74     "./_build/dups.native"
  75     "./dups"
  76 3d1c679e5621b8150f54d21f3ef6dcad 2
  77     "./_build/dups.ml"
  78     "./dups.ml"
  79 Time                       : 0.031084 seconds
  80 Considered                 : 121
  81 Hashed                     : 45
  82 Skipped due to 0      size : 2
  83 Skipped due to unique size : 74
  84 Ignored due to regex match : 0
  85
  86 ```
  87 Note that the report lines are written to `stderr`, so that `stdout` is safely
  88 processable by other tools:
  89
  90 ```
  91 $ ./dups . 2> /dev/null
  92 e40e3c4330857e2762d043427b499301 2
  93     "./_build/dups.native"
  94     "./dups"
  95 3d1c679e5621b8150f54d21f3ef6dcad 2
  96     "./_build/dups.ml"
  97     "./dups.ml"
  98
  99 $ ./dups . 1> /dev/null
 100 Time                       : 0.070765 seconds
 101 Considered                 : 121
 102 Hashed                     : 45
 103 Skipped due to 0      size : 2
 104 Skipped due to unique size : 74
 105 Ignored due to regex match : 0
 106
 107 ```