4 Find duplicate files in given directory trees. Where "duplicate" is defined as
5 having the same (and non-0) file size and MD5 hash digest.
7 It is roughly equivalent to the following one-liner:
9 find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print " ", paths[digest, i]} } } }'
12 which, when indented, looks like:
14 find . -type f -exec md5sum '{}' \; \
19 paths[digest, ++count[digest]] = path
23 for (digest in count) {
27 for (i=1; i<=n; i++) {
28 print " ", paths[digest, i]
35 and works well-enough, until you start getting weird file paths that are more
36 of a pain to handle quoting for than re-writing this thing in OCaml :)
40 After building, run `dups` on the current directory tree:
44 Finished, 0 targets (0 cached) in 00:00:00.
45 Finished, 5 targets (0 cached) in 00:00:00.
48 e40e3c4330857e2762d043427b499301 2
49 "./_build/dups.native"
51 3d1c679e5621b8150f54d21f3ef6dcad 2
54 Time : 0.031084 seconds
57 Skipped due to 0 size : 2
58 Skipped due to unique size : 74
59 Ignored due to regex match : 0
62 Note that the report lines are written to `stderr`, so that `stdout` is safely
63 processable by other tools:
66 $ ./dups . 2> /dev/null
67 e40e3c4330857e2762d043427b499301 2
68 "./_build/dups.native"
70 3d1c679e5621b8150f54d21f3ef6dcad 2
74 $ ./dups . 1> /dev/null
75 Time : 0.070765 seconds
78 Skipped due to 0 size : 2
79 Skipped due to unique size : 74
80 Ignored due to regex match : 0