| | 1 | dups |
| | 2 | ==== |
| | 3 | |
| | 4 | Find duplicate files in N given directory trees. Where "duplicate" is defined |
| | 5 | as having the same (and non-0) file size and MD5 hash digest. |
| | 6 | |
| | 7 | It is roughly equivalent to the following one-liner (included as `dups.sh`): |
| | 8 | ```sh |
| | 9 | find . -type f -print0 | xargs -0 -P $(nproc) -I % md5sum % | awk '{digest = $1; sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf " %s\n", paths[digest, i]} } } }' |
| | 10 | ``` |
| | 11 | |
| | 12 | which, when indented, looks like: |
| | 13 | ```sh |
| | 14 | find . -type f -print0 \ |
| | 15 | | xargs -0 -P $(nproc) md5sum \ |
| | 16 | | awk ' |
| | 17 | { |
| | 18 | digest = $1 |
| | 19 | sub("^" $1 " +", "") |
| | 20 | path = $0 |
| | 21 | paths[digest, ++count[digest]] = path |
| | 22 | } |
| | 23 | |
| | 24 | END { |
| | 25 | for (digest in count) { |
| | 26 | n = count[digest] |
| | 27 | if (n > 1) { |
| | 28 | print(digest, n) |
| | 29 | for (i=1; i<=n; i++) { |
| | 30 | printf " %s\n", paths[digest, i] |
| | 31 | } |
| | 32 | } |
| | 33 | } |
| | 34 | }' |
| | 35 | ``` |
| | 36 | |
| | 37 | and works well-enough, but is painfully slow (for instance, it takes around 8 |
| | 38 | minutes to process my home directory, whereas `dups` takes around 8 seconds). |
| | 39 | |
| | 40 | Originally, my main motivation for rewriting the above script in OCaml was |
| | 41 | simply to avoid dealing with file paths containing newlines and spaces (the |
| | 42 | original rewrite was substantially simpler than it currently is). |
| | 43 | |
| | 44 | I since realized that, on the _input_, the problem is avoided by delimiting the |
| | 45 | found paths with the null byte, rather than a newline and in AWK doing an |
| | 46 | ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`). |
| | 47 | |
| | 48 | However, on the _output_, I still don't know of a _simple_ way to escape the |
| | 49 | newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf` |
| | 50 | there's `%q`). |
| | 51 | |
| | 52 | In any case, I now have 2 other reasons to continue with this project: |
| | 53 | 1. The speed-up is a boon to my UX (thanks in large part to optimizations |
| | 54 | suggested by @Aeronotix); |
| | 55 | 2. I plan to extend the feature set, which is just too-unpleasant to manage in |
| | 56 | a string-oriented PL: |
| | 57 | 1. byte-by-byte comparison of files that hash to the same digest, to make |
| | 58 | super-duper sure they are indeed the same and do not just happen to |
| | 59 | collide; |
| | 60 | 2. extend the metrics reporting; |
| | 61 | 3. output sorting options. |
| | 62 | |
| | 63 | Example |
| | 64 | ------- |
| | 65 | After building, run `dups` on the current directory tree: |
| | 66 | |
| | 67 | ```sh |
| | 68 | $ make |
| | 69 | Finished, 0 targets (0 cached) in 00:00:00. |
| | 70 | Finished, 5 targets (0 cached) in 00:00:00. |
| | 71 | |
| | 72 | $ ./dups . |
| | 73 | e40e3c4330857e2762d043427b499301 2 |
| | 74 | "./_build/dups.native" |
| | 75 | "./dups" |
| | 76 | 3d1c679e5621b8150f54d21f3ef6dcad 2 |
| | 77 | "./_build/dups.ml" |
| | 78 | "./dups.ml" |
| | 79 | Time : 0.031084 seconds |
| | 80 | Considered : 121 |
| | 81 | Hashed : 45 |
| | 82 | Skipped due to 0 size : 2 |
| | 83 | Skipped due to unique size : 74 |
| | 84 | Ignored due to regex match : 0 |
| | 85 | |
| | 86 | ``` |
| | 87 | Note that the report lines are written to `stderr`, so that `stdout` is safely |
| | 88 | processable by other tools: |
| | 89 | |
| | 90 | ``` |
| | 91 | $ ./dups . 2> /dev/null |
| | 92 | e40e3c4330857e2762d043427b499301 2 |
| | 93 | "./_build/dups.native" |
| | 94 | "./dups" |
| | 95 | 3d1c679e5621b8150f54d21f3ef6dcad 2 |
| | 96 | "./_build/dups.ml" |
| | 97 | "./dups.ml" |
| | 98 | |
| | 99 | $ ./dups . 1> /dev/null |
| | 100 | Time : 0.070765 seconds |
| | 101 | Considered : 121 |
| | 102 | Hashed : 45 |
| | 103 | Skipped due to 0 size : 2 |
| | 104 | Skipped due to unique size : 74 |
| | 105 | Ignored due to regex match : 0 |
| | 106 | |
| | 107 | ``` |