Update the shell-equivalent implementation and motivation

[dups.git] / README.md
diff --git a/README.md b/README.md

index a6e6c54..04b84ec 100644 (file)
--- a/README.md
+++ b/README.md
@@ -2,35 +2,63 @@ dups
  ====
  
  Find duplicate files in given directory trees. Where "duplicate" is defined as
-having the same MD5 hash digest.
+having the same (and non-0) file size and MD5 hash digest.
  
  It is roughly equivalent to the following one-liner:
  ```sh
-find . -type f -exec md5sum '{}' \; | awk '{paths[$1, ++cnt[$1]] = $2} END {for (path in cnt) {n = cnt[path]; if (n > 1) {print(path, n); for (i=1; i<=n; i++) {print("    ", paths[path, i])} } } }'
+find . -type f -print0 | xargs -0 -P 6 -I % md5sum % | awk '{digest = $1;  sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf "    %s\n", paths[digest, i]} } } }'
  ```
  
  which, when indented, looks like:
  ```sh
-find . -type f -exec md5sum '{}' \; \
+find . -type f -print0 \
+| xargs -0 -P $(nproc) md5sum \
  | awk '
      {
-        paths[$1, ++cnt[$1]] = $2
+        digest = $1
+        sub("^" $1 " +", "")
+        path = $0
+        paths[digest, ++count[digest]] = path
      }
+
      END {
-        for (path in cnt) {
-            n = cnt[path]
+        for (digest in count) {
+            n = count[digest]
              if (n > 1) {
-                print(path, n)
+                print(digest, n)
                  for (i=1; i<=n; i++) {
-                    print("    ", paths[path, i])
+                    printf "    %s\n", paths[digest, i]
                  }
              }
          }
      }'
  ```
  
-and works well-enough, until you start getting weird file paths that are more
-of a pain to handle quoting for than re-writing this thing in OCaml :)
+and works well-enough, but is painfully slow (for instance, it takes around 8
+minutes to process my home directory, whereas `dups` takes around 8 seconds).
+
+Originally, my main motivation for rewriting the above script in OCaml was
+simply to avoid dealing with file paths containing newlines and spaces (the
+original rewrite was substantially simpler than it currently is).
+
+I since realized that, on the _input_, the problem is avoided by delimiting the
+found paths with the null byte, rather than a newline and in AWK doing an
+ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`).
+
+However, on the _output_, I still don't know of a _simple_ way to escape the
+newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf`
+there's `%q`).
+
+In any case, I now have 2 other reasons to continue with this project:
+1. The speed-up is a boon to my UX (thanks in large part to optimizations
+   suggested by @Aeronotix);
+2. I plan to extend the feature set, which is just too-unpleasant to manage in
+   a string-oriented PL:
+    1. byte-by-byte comparison of files that hash to the same digest, to make
+       super-duper sure they are indeed the same and do not just happen to
+       collide;
+    2. extend the metrics reporting;
+    3. output sorting options.
  
  Example
  -------
@@ -42,16 +70,38 @@ Finished, 0 targets (0 cached) in 00:00:00.
  Finished, 5 targets (0 cached) in 00:00:00.
  
  $ ./dups .
-df4235f3da793b798095047810153c6b 2
+e40e3c4330857e2762d043427b499301 2
+    "./_build/dups.native"
+    "./dups"
+3d1c679e5621b8150f54d21f3ef6dcad 2
      "./_build/dups.ml"
      "./dups.ml"
-d41d8cd98f00b204e9800998ecf8427e 2
-    "./_build/dups.mli"
-    "./dups.mli"
-087809b180957ce812a39a5163554502 2
+Time                       : 0.031084 seconds
+Considered                 : 121
+Hashed                     : 45
+Skipped due to 0      size : 2
+Skipped due to unique size : 74
+Ignored due to regex match : 0
+
+```
+Note that the report lines are written to `stderr`, so that `stdout` is safely
+processable by other tools:
+
+```
+$ ./dups . 2> /dev/null
+e40e3c4330857e2762d043427b499301 2
      "./_build/dups.native"
      "./dups"
-Processed 102 files in 0.025761 seconds.
+3d1c679e5621b8150f54d21f3ef6dcad 2
+    "./_build/dups.ml"
+    "./dups.ml"
+
+$ ./dups . 1> /dev/null
+Time                       : 0.070765 seconds
+Considered                 : 121
+Hashed                     : 45
+Skipped due to 0      size : 2
+Skipped due to unique size : 74
+Ignored due to regex match : 0
+
  ```
-Note that the report line (`Processed 102 files in 0.025761 seconds.`) is
-written to `stderr`, so that `stdout` is safely processable by other tools.