From dbb52e5c345aeafd3b7a2f142ca6bf2039616574 Mon Sep 17 00:00:00 2001 From: Siraaj Khandkar Date: Wed, 28 Nov 2018 17:12:13 -0500 Subject: [PATCH] Update the shell-equivalent implementation and motivation --- README.md | 37 +++++++++++++++++++++++++++++++------ 1 file changed, 31 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 979eddc..04b84ec 100644 --- a/README.md +++ b/README.md @@ -6,16 +6,18 @@ having the same (and non-0) file size and MD5 hash digest. It is roughly equivalent to the following one-liner: ```sh -find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print " ", paths[digest, i]} } } }' +find . -type f -print0 | xargs -0 -P 6 -I % md5sum % | awk '{digest = $1; sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf " %s\n", paths[digest, i]} } } }' ``` which, when indented, looks like: ```sh -find . -type f -exec md5sum '{}' \; \ +find . -type f -print0 \ +| xargs -0 -P $(nproc) md5sum \ | awk ' { digest = $1 - path = $2 + sub("^" $1 " +", "") + path = $0 paths[digest, ++count[digest]] = path } @@ -25,15 +27,38 @@ find . -type f -exec md5sum '{}' \; \ if (n > 1) { print(digest, n) for (i=1; i<=n; i++) { - print " ", paths[digest, i] + printf " %s\n", paths[digest, i] } } } }' ``` -and works well-enough, until you start getting weird file paths that are more -of a pain to handle quoting for than re-writing this thing in OCaml :) +and works well-enough, but is painfully slow (for instance, it takes around 8 +minutes to process my home directory, whereas `dups` takes around 8 seconds). + +Originally, my main motivation for rewriting the above script in OCaml was +simply to avoid dealing with file paths containing newlines and spaces (the +original rewrite was substantially simpler than it currently is). + +I since realized that, on the _input_, the problem is avoided by delimiting the +found paths with the null byte, rather than a newline and in AWK doing an +ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`). + +However, on the _output_, I still don't know of a _simple_ way to escape the +newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf` +there's `%q`). + +In any case, I now have 2 other reasons to continue with this project: +1. The speed-up is a boon to my UX (thanks in large part to optimizations + suggested by @Aeronotix); +2. I plan to extend the feature set, which is just too-unpleasant to manage in + a string-oriented PL: + 1. byte-by-byte comparison of files that hash to the same digest, to make + super-duper sure they are indeed the same and do not just happen to + collide; + 2. extend the metrics reporting; + 3. output sorting options. Example ------- -- 2.20.1