| 1 | dups |
| 2 | ==== |
| 3 | |
| 4 | Find duplicate files in given directory trees. Where "duplicate" is defined as |
| 5 | having the same (and non-0) file size and MD5 hash digest. |
| 6 | |
| 7 | It is roughly equivalent to the following one-liner: |
| 8 | ```sh |
| 9 | find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print " ", paths[digest, i]} } } }' |
| 10 | ``` |
| 11 | |
| 12 | which, when indented, looks like: |
| 13 | ```sh |
| 14 | find . -type f -exec md5sum '{}' \; \ |
| 15 | | awk ' |
| 16 | { |
| 17 | digest = $1 |
| 18 | path = $2 |
| 19 | paths[digest, ++count[digest]] = path |
| 20 | } |
| 21 | |
| 22 | END { |
| 23 | for (digest in count) { |
| 24 | n = count[digest] |
| 25 | if (n > 1) { |
| 26 | print(digest, n) |
| 27 | for (i=1; i<=n; i++) { |
| 28 | print " ", paths[digest, i] |
| 29 | } |
| 30 | } |
| 31 | } |
| 32 | }' |
| 33 | ``` |
| 34 | |
| 35 | and works well-enough, until you start getting weird file paths that are more |
| 36 | of a pain to handle quoting for than re-writing this thing in OCaml :) |
| 37 | |
| 38 | Example |
| 39 | ------- |
| 40 | After building, run `dups` on the current directory tree: |
| 41 | |
| 42 | ```sh |
| 43 | $ make |
| 44 | Finished, 0 targets (0 cached) in 00:00:00. |
| 45 | Finished, 5 targets (0 cached) in 00:00:00. |
| 46 | |
| 47 | $ ./dups . |
| 48 | e40e3c4330857e2762d043427b499301 2 |
| 49 | "./_build/dups.native" |
| 50 | "./dups" |
| 51 | 3d1c679e5621b8150f54d21f3ef6dcad 2 |
| 52 | "./_build/dups.ml" |
| 53 | "./dups.ml" |
| 54 | Time : 0.031084 seconds |
| 55 | Considered : 121 |
| 56 | Hashed : 45 |
| 57 | Skipped due to 0 size : 2 |
| 58 | Skipped due to unique size : 74 |
| 59 | Ignored due to regex match : 0 |
| 60 | |
| 61 | ``` |
| 62 | Note that the report lines are written to `stderr`, so that `stdout` is safely |
| 63 | processable by other tools: |
| 64 | |
| 65 | ``` |
| 66 | $ ./dups . 2> /dev/null |
| 67 | e40e3c4330857e2762d043427b499301 2 |
| 68 | "./_build/dups.native" |
| 69 | "./dups" |
| 70 | 3d1c679e5621b8150f54d21f3ef6dcad 2 |
| 71 | "./_build/dups.ml" |
| 72 | "./dups.ml" |
| 73 | |
| 74 | $ ./dups . 1> /dev/null |
| 75 | Time : 0.070765 seconds |
| 76 | Considered : 121 |
| 77 | Hashed : 45 |
| 78 | Skipped due to 0 size : 2 |
| 79 | Skipped due to unique size : 74 |
| 80 | Ignored due to regex match : 0 |
| 81 | |
| 82 | ``` |