Commit | Line | Data |
---|---|---|
b839d582 SK |
1 | dups |
2 | ==== | |
3 | ||
4 | Find duplicate files in given directory trees. Where "duplicate" is defined as | |
5 | having the same MD5 hash digest. | |
6 | ||
7 | It is roughly equivalent to the following one-liner: | |
8 | ```sh | |
4d53b6c0 | 9 | find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print " ", paths[digest, i]} } } }' |
b839d582 SK |
10 | ``` |
11 | ||
12 | which, when indented, looks like: | |
13 | ```sh | |
14 | find . -type f -exec md5sum '{}' \; \ | |
15 | | awk ' | |
16 | { | |
4d53b6c0 SK |
17 | digest = $1 |
18 | path = $2 | |
19 | paths[digest, ++count[digest]] = path | |
b839d582 | 20 | } |
4d53b6c0 | 21 | |
b839d582 | 22 | END { |
4d53b6c0 SK |
23 | for (digest in count) { |
24 | n = count[digest] | |
b839d582 | 25 | if (n > 1) { |
4d53b6c0 | 26 | print(digest, n) |
b839d582 | 27 | for (i=1; i<=n; i++) { |
4d53b6c0 | 28 | print " ", paths[digest, i] |
b839d582 SK |
29 | } |
30 | } | |
31 | } | |
32 | }' | |
33 | ``` | |
34 | ||
35 | and works well-enough, until you start getting weird file paths that are more | |
36 | of a pain to handle quoting for than re-writing this thing in OCaml :) | |
37 | ||
38 | Example | |
39 | ------- | |
40 | After building, run `dups` on the current directory tree: | |
41 | ||
42 | ```sh | |
43 | $ make | |
44 | Finished, 0 targets (0 cached) in 00:00:00. | |
45 | Finished, 5 targets (0 cached) in 00:00:00. | |
46 | ||
47 | $ ./dups . | |
48 | df4235f3da793b798095047810153c6b 2 | |
49 | "./_build/dups.ml" | |
50 | "./dups.ml" | |
51 | d41d8cd98f00b204e9800998ecf8427e 2 | |
52 | "./_build/dups.mli" | |
53 | "./dups.mli" | |
54 | 087809b180957ce812a39a5163554502 2 | |
55 | "./_build/dups.native" | |
56 | "./dups" | |
57 | Processed 102 files in 0.025761 seconds. | |
58 | ``` | |
59 | Note that the report line (`Processed 102 files in 0.025761 seconds.`) is | |
60 | written to `stderr`, so that `stdout` is safely processable by other tools. |