Commit | Line | Data |
---|---|---|
b839d582 SK |
1 | dups |
2 | ==== | |
3 | ||
4 | Find duplicate files in given directory trees. Where "duplicate" is defined as | |
5cba374c | 5 | having the same (and non-0) file size and MD5 hash digest. |
b839d582 SK |
6 | |
7 | It is roughly equivalent to the following one-liner: | |
8 | ```sh | |
4d53b6c0 | 9 | find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print " ", paths[digest, i]} } } }' |
b839d582 SK |
10 | ``` |
11 | ||
12 | which, when indented, looks like: | |
13 | ```sh | |
14 | find . -type f -exec md5sum '{}' \; \ | |
15 | | awk ' | |
16 | { | |
4d53b6c0 SK |
17 | digest = $1 |
18 | path = $2 | |
19 | paths[digest, ++count[digest]] = path | |
b839d582 | 20 | } |
4d53b6c0 | 21 | |
b839d582 | 22 | END { |
4d53b6c0 SK |
23 | for (digest in count) { |
24 | n = count[digest] | |
b839d582 | 25 | if (n > 1) { |
4d53b6c0 | 26 | print(digest, n) |
b839d582 | 27 | for (i=1; i<=n; i++) { |
4d53b6c0 | 28 | print " ", paths[digest, i] |
b839d582 SK |
29 | } |
30 | } | |
31 | } | |
32 | }' | |
33 | ``` | |
34 | ||
35 | and works well-enough, until you start getting weird file paths that are more | |
36 | of a pain to handle quoting for than re-writing this thing in OCaml :) | |
37 | ||
38 | Example | |
39 | ------- | |
40 | After building, run `dups` on the current directory tree: | |
41 | ||
42 | ```sh | |
43 | $ make | |
44 | Finished, 0 targets (0 cached) in 00:00:00. | |
45 | Finished, 5 targets (0 cached) in 00:00:00. | |
46 | ||
47 | $ ./dups . | |
5cba374c SK |
48 | e40e3c4330857e2762d043427b499301 2 |
49 | "./_build/dups.native" | |
50 | "./dups" | |
51 | 3d1c679e5621b8150f54d21f3ef6dcad 2 | |
b839d582 SK |
52 | "./_build/dups.ml" |
53 | "./dups.ml" | |
5cba374c SK |
54 | Time : 0.031084 seconds |
55 | Considered : 121 | |
56 | Hashed : 45 | |
57 | Skipped due to 0 size : 2 | |
58 | Skipped due to unique size : 74 | |
59 | Ignored due to regex match : 0 | |
60 | ||
61 | ``` | |
62 | Note that the report lines are written to `stderr`, so that `stdout` is safely | |
63 | processable by other tools: | |
64 | ||
65 | ``` | |
66 | $ ./dups . 2> /dev/null | |
67 | e40e3c4330857e2762d043427b499301 2 | |
b839d582 SK |
68 | "./_build/dups.native" |
69 | "./dups" | |
5cba374c SK |
70 | 3d1c679e5621b8150f54d21f3ef6dcad 2 |
71 | "./_build/dups.ml" | |
72 | "./dups.ml" | |
73 | ||
74 | $ ./dups . 1> /dev/null | |
75 | Time : 0.070765 seconds | |
76 | Considered : 121 | |
77 | Hashed : 45 | |
78 | Skipped due to 0 size : 2 | |
79 | Skipped due to unique size : 74 | |
80 | Ignored due to regex match : 0 | |
81 | ||
b839d582 | 82 | ``` |