Commit | Line | Data |
---|---|---|
b839d582 SK |
1 | dups |
2 | ==== | |
3 | ||
f41b9cdf SK |
4 | Find duplicate files in N given directory trees. Where "duplicate" is defined |
5 | as having the same (and non-0) file size and MD5 hash digest. | |
b839d582 | 6 | |
f41b9cdf | 7 | It is roughly equivalent to the following one-liner (included as `dups.sh`): |
b839d582 | 8 | ```sh |
f289b74b | 9 | find . -type f -print0 | xargs -0 -P $(nproc) -I % md5sum % | awk '{digest = $1; sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf " %s\n", paths[digest, i]} } } }' |
b839d582 SK |
10 | ``` |
11 | ||
12 | which, when indented, looks like: | |
13 | ```sh | |
dbb52e5c SK |
14 | find . -type f -print0 \ |
15 | | xargs -0 -P $(nproc) md5sum \ | |
b839d582 SK |
16 | | awk ' |
17 | { | |
4d53b6c0 | 18 | digest = $1 |
dbb52e5c SK |
19 | sub("^" $1 " +", "") |
20 | path = $0 | |
4d53b6c0 | 21 | paths[digest, ++count[digest]] = path |
b839d582 | 22 | } |
4d53b6c0 | 23 | |
b839d582 | 24 | END { |
4d53b6c0 SK |
25 | for (digest in count) { |
26 | n = count[digest] | |
b839d582 | 27 | if (n > 1) { |
4d53b6c0 | 28 | print(digest, n) |
b839d582 | 29 | for (i=1; i<=n; i++) { |
dbb52e5c | 30 | printf " %s\n", paths[digest, i] |
b839d582 SK |
31 | } |
32 | } | |
33 | } | |
34 | }' | |
35 | ``` | |
36 | ||
dbb52e5c SK |
37 | and works well-enough, but is painfully slow (for instance, it takes around 8 |
38 | minutes to process my home directory, whereas `dups` takes around 8 seconds). | |
39 | ||
40 | Originally, my main motivation for rewriting the above script in OCaml was | |
41 | simply to avoid dealing with file paths containing newlines and spaces (the | |
42 | original rewrite was substantially simpler than it currently is). | |
43 | ||
44 | I since realized that, on the _input_, the problem is avoided by delimiting the | |
45 | found paths with the null byte, rather than a newline and in AWK doing an | |
46 | ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`). | |
47 | ||
48 | However, on the _output_, I still don't know of a _simple_ way to escape the | |
49 | newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf` | |
50 | there's `%q`). | |
51 | ||
52 | In any case, I now have 2 other reasons to continue with this project: | |
53 | 1. The speed-up is a boon to my UX (thanks in large part to optimizations | |
54 | suggested by @Aeronotix); | |
55 | 2. I plan to extend the feature set, which is just too-unpleasant to manage in | |
56 | a string-oriented PL: | |
57 | 1. byte-by-byte comparison of files that hash to the same digest, to make | |
58 | super-duper sure they are indeed the same and do not just happen to | |
59 | collide; | |
60 | 2. extend the metrics reporting; | |
61 | 3. output sorting options. | |
b839d582 SK |
62 | |
63 | Example | |
64 | ------- | |
65 | After building, run `dups` on the current directory tree: | |
66 | ||
67 | ```sh | |
68 | $ make | |
69 | Finished, 0 targets (0 cached) in 00:00:00. | |
70 | Finished, 5 targets (0 cached) in 00:00:00. | |
71 | ||
72 | $ ./dups . | |
5cba374c SK |
73 | e40e3c4330857e2762d043427b499301 2 |
74 | "./_build/dups.native" | |
75 | "./dups" | |
76 | 3d1c679e5621b8150f54d21f3ef6dcad 2 | |
b839d582 SK |
77 | "./_build/dups.ml" |
78 | "./dups.ml" | |
5cba374c SK |
79 | Time : 0.031084 seconds |
80 | Considered : 121 | |
81 | Hashed : 45 | |
82 | Skipped due to 0 size : 2 | |
83 | Skipped due to unique size : 74 | |
84 | Ignored due to regex match : 0 | |
85 | ||
86 | ``` | |
87 | Note that the report lines are written to `stderr`, so that `stdout` is safely | |
88 | processable by other tools: | |
89 | ||
90 | ``` | |
91 | $ ./dups . 2> /dev/null | |
92 | e40e3c4330857e2762d043427b499301 2 | |
b839d582 SK |
93 | "./_build/dups.native" |
94 | "./dups" | |
5cba374c SK |
95 | 3d1c679e5621b8150f54d21f3ef6dcad 2 |
96 | "./_build/dups.ml" | |
97 | "./dups.ml" | |
98 | ||
99 | $ ./dups . 1> /dev/null | |
100 | Time : 0.070765 seconds | |
101 | Considered : 121 | |
102 | Hashed : 45 | |
103 | Skipped due to 0 size : 2 | |
104 | Skipped due to unique size : 74 | |
105 | Ignored due to regex match : 0 | |
106 | ||
b839d582 | 107 | ``` |