X-Git-Url: https://git.xandkar.net/?p=dups.git;a=blobdiff_plain;f=README.md;h=04b84ecbbef4b1b95bef57749c4a0ae7a70172d4;hp=a6e6c54804ace51f30c790c11ec56ad992fbbdb5;hb=dbb52e5c345aeafd3b7a2f142ca6bf2039616574;hpb=b839d582481df4861b7bdf123f404dcf13ee5bbd diff --git a/README.md b/README.md index a6e6c54..04b84ec 100644 --- a/README.md +++ b/README.md @@ -2,35 +2,63 @@ dups ==== Find duplicate files in given directory trees. Where "duplicate" is defined as -having the same MD5 hash digest. +having the same (and non-0) file size and MD5 hash digest. It is roughly equivalent to the following one-liner: ```sh -find . -type f -exec md5sum '{}' \; | awk '{paths[$1, ++cnt[$1]] = $2} END {for (path in cnt) {n = cnt[path]; if (n > 1) {print(path, n); for (i=1; i<=n; i++) {print(" ", paths[path, i])} } } }' +find . -type f -print0 | xargs -0 -P 6 -I % md5sum % | awk '{digest = $1; sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf " %s\n", paths[digest, i]} } } }' ``` which, when indented, looks like: ```sh -find . -type f -exec md5sum '{}' \; \ +find . -type f -print0 \ +| xargs -0 -P $(nproc) md5sum \ | awk ' { - paths[$1, ++cnt[$1]] = $2 + digest = $1 + sub("^" $1 " +", "") + path = $0 + paths[digest, ++count[digest]] = path } + END { - for (path in cnt) { - n = cnt[path] + for (digest in count) { + n = count[digest] if (n > 1) { - print(path, n) + print(digest, n) for (i=1; i<=n; i++) { - print(" ", paths[path, i]) + printf " %s\n", paths[digest, i] } } } }' ``` -and works well-enough, until you start getting weird file paths that are more -of a pain to handle quoting for than re-writing this thing in OCaml :) +and works well-enough, but is painfully slow (for instance, it takes around 8 +minutes to process my home directory, whereas `dups` takes around 8 seconds). + +Originally, my main motivation for rewriting the above script in OCaml was +simply to avoid dealing with file paths containing newlines and spaces (the +original rewrite was substantially simpler than it currently is). + +I since realized that, on the _input_, the problem is avoided by delimiting the +found paths with the null byte, rather than a newline and in AWK doing an +ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`). + +However, on the _output_, I still don't know of a _simple_ way to escape the +newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf` +there's `%q`). + +In any case, I now have 2 other reasons to continue with this project: +1. The speed-up is a boon to my UX (thanks in large part to optimizations + suggested by @Aeronotix); +2. I plan to extend the feature set, which is just too-unpleasant to manage in + a string-oriented PL: + 1. byte-by-byte comparison of files that hash to the same digest, to make + super-duper sure they are indeed the same and do not just happen to + collide; + 2. extend the metrics reporting; + 3. output sorting options. Example ------- @@ -42,16 +70,38 @@ Finished, 0 targets (0 cached) in 00:00:00. Finished, 5 targets (0 cached) in 00:00:00. $ ./dups . -df4235f3da793b798095047810153c6b 2 +e40e3c4330857e2762d043427b499301 2 + "./_build/dups.native" + "./dups" +3d1c679e5621b8150f54d21f3ef6dcad 2 "./_build/dups.ml" "./dups.ml" -d41d8cd98f00b204e9800998ecf8427e 2 - "./_build/dups.mli" - "./dups.mli" -087809b180957ce812a39a5163554502 2 +Time : 0.031084 seconds +Considered : 121 +Hashed : 45 +Skipped due to 0 size : 2 +Skipped due to unique size : 74 +Ignored due to regex match : 0 + +``` +Note that the report lines are written to `stderr`, so that `stdout` is safely +processable by other tools: + +``` +$ ./dups . 2> /dev/null +e40e3c4330857e2762d043427b499301 2 "./_build/dups.native" "./dups" -Processed 102 files in 0.025761 seconds. +3d1c679e5621b8150f54d21f3ef6dcad 2 + "./_build/dups.ml" + "./dups.ml" + +$ ./dups . 1> /dev/null +Time : 0.070765 seconds +Considered : 121 +Hashed : 45 +Skipped due to 0 size : 2 +Skipped due to unique size : 74 +Ignored due to regex match : 0 + ``` -Note that the report line (`Processed 102 files in 0.025761 seconds.`) is -written to `stderr`, so that `stdout` is safely processable by other tools.