dups
====

Find duplicate files in given directory trees. Where "duplicate" is defined as
having the same MD5 hash digest.

It is roughly equivalent to the following one-liner:
```sh
find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print "    ", paths[digest, i]} } } }'
```

which, when indented, looks like:
```sh
find . -type f -exec md5sum '{}' \; \
| awk '
    {
        digest = $1
        path = $2
        paths[digest, ++count[digest]] = path
    }

    END {
        for (digest in count) {
            n = count[digest]
            if (n > 1) {
                print(digest, n)
                for (i=1; i<=n; i++) {
                    print "    ", paths[digest, i]
                }
            }
        }
    }'
```

and works well-enough, until you start getting weird file paths that are more
of a pain to handle quoting for than re-writing this thing in OCaml :)

Example
-------
After building, run `dups` on the current directory tree:

```sh
$ make
Finished, 0 targets (0 cached) in 00:00:00.
Finished, 5 targets (0 cached) in 00:00:00.

$ ./dups .
df4235f3da793b798095047810153c6b 2
    "./_build/dups.ml"
    "./dups.ml"
d41d8cd98f00b204e9800998ecf8427e 2
    "./_build/dups.mli"
    "./dups.mli"
087809b180957ce812a39a5163554502 2
    "./_build/dups.native"
    "./dups"
Processed 102 files in 0.025761 seconds.
```
Note that the report line (`Processed 102 files in 0.025761 seconds.`) is
written to `stderr`, so that `stdout` is safely processable by other tools.