====
Find duplicate files in given directory trees. Where "duplicate" is defined as
-having the same MD5 hash digest.
+having the same (and non-0) file size and MD5 hash digest.
It is roughly equivalent to the following one-liner:
```sh
-find . -type f -exec md5sum '{}' \; | awk '{paths[$1, ++cnt[$1]] = $2} END {for (path in cnt) {n = cnt[path]; if (n > 1) {print(path, n); for (i=1; i<=n; i++) {print(" ", paths[path, i])} } } }'
+find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print " ", paths[digest, i]} } } }'
```
which, when indented, looks like:
find . -type f -exec md5sum '{}' \; \
| awk '
{
- paths[$1, ++cnt[$1]] = $2
+ digest = $1
+ path = $2
+ paths[digest, ++count[digest]] = path
}
+
END {
- for (path in cnt) {
- n = cnt[path]
+ for (digest in count) {
+ n = count[digest]
if (n > 1) {
- print(path, n)
+ print(digest, n)
for (i=1; i<=n; i++) {
- print(" ", paths[path, i])
+ print " ", paths[digest, i]
}
}
}
Finished, 5 targets (0 cached) in 00:00:00.
$ ./dups .
-df4235f3da793b798095047810153c6b 2
+e40e3c4330857e2762d043427b499301 2
+ "./_build/dups.native"
+ "./dups"
+3d1c679e5621b8150f54d21f3ef6dcad 2
"./_build/dups.ml"
"./dups.ml"
-d41d8cd98f00b204e9800998ecf8427e 2
- "./_build/dups.mli"
- "./dups.mli"
-087809b180957ce812a39a5163554502 2
+Time : 0.031084 seconds
+Considered : 121
+Hashed : 45
+Skipped due to 0 size : 2
+Skipped due to unique size : 74
+Ignored due to regex match : 0
+
+```
+Note that the report lines are written to `stderr`, so that `stdout` is safely
+processable by other tools:
+
+```
+$ ./dups . 2> /dev/null
+e40e3c4330857e2762d043427b499301 2
"./_build/dups.native"
"./dups"
-Processed 102 files in 0.025761 seconds.
+3d1c679e5621b8150f54d21f3ef6dcad 2
+ "./_build/dups.ml"
+ "./dups.ml"
+
+$ ./dups . 1> /dev/null
+Time : 0.070765 seconds
+Considered : 121
+Hashed : 45
+Skipped due to 0 size : 2
+Skipped due to unique size : 74
+Ignored due to regex match : 0
+
```
-Note that the report line (`Processed 102 files in 0.025761 seconds.`) is
-written to `stderr`, so that `stdout` is safely processable by other tools.