Skip 0-sized files in the shell version
[dups.git] / README.md
CommitLineData
b839d582
SK
1dups
2====
3
f41b9cdf
SK
4Find duplicate files in N given directory trees. Where "duplicate" is defined
5as having the same (and non-0) file size and MD5 hash digest.
b839d582 6
f41b9cdf 7It is roughly equivalent to the following one-liner (included as `dups.sh`):
b839d582 8```sh
f289b74b 9find . -type f -print0 | xargs -0 -P $(nproc) -I % md5sum % | awk '{digest = $1; sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf " %s\n", paths[digest, i]} } } }'
b839d582
SK
10```
11
12which, when indented, looks like:
13```sh
dbb52e5c
SK
14find . -type f -print0 \
15| xargs -0 -P $(nproc) md5sum \
b839d582
SK
16| awk '
17 {
4d53b6c0 18 digest = $1
dbb52e5c
SK
19 sub("^" $1 " +", "")
20 path = $0
4d53b6c0 21 paths[digest, ++count[digest]] = path
b839d582 22 }
4d53b6c0 23
b839d582 24 END {
4d53b6c0
SK
25 for (digest in count) {
26 n = count[digest]
b839d582 27 if (n > 1) {
4d53b6c0 28 print(digest, n)
b839d582 29 for (i=1; i<=n; i++) {
dbb52e5c 30 printf " %s\n", paths[digest, i]
b839d582
SK
31 }
32 }
33 }
34 }'
35```
36
dbb52e5c
SK
37and works well-enough, but is painfully slow (for instance, it takes around 8
38minutes to process my home directory, whereas `dups` takes around 8 seconds).
39
40Originally, my main motivation for rewriting the above script in OCaml was
41simply to avoid dealing with file paths containing newlines and spaces (the
42original rewrite was substantially simpler than it currently is).
43
44I since realized that, on the _input_, the problem is avoided by delimiting the
45found paths with the null byte, rather than a newline and in AWK doing an
46ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`).
47
48However, on the _output_, I still don't know of a _simple_ way to escape the
49newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf`
50there's `%q`).
51
52In any case, I now have 2 other reasons to continue with this project:
531. The speed-up is a boon to my UX (thanks in large part to optimizations
54 suggested by @Aeronotix);
552. I plan to extend the feature set, which is just too-unpleasant to manage in
56 a string-oriented PL:
57 1. byte-by-byte comparison of files that hash to the same digest, to make
58 super-duper sure they are indeed the same and do not just happen to
59 collide;
60 2. extend the metrics reporting;
61 3. output sorting options.
b839d582
SK
62
63Example
64-------
65After building, run `dups` on the current directory tree:
66
67```sh
68$ make
69Finished, 0 targets (0 cached) in 00:00:00.
70Finished, 5 targets (0 cached) in 00:00:00.
71
72$ ./dups .
5cba374c
SK
73e40e3c4330857e2762d043427b499301 2
74 "./_build/dups.native"
75 "./dups"
763d1c679e5621b8150f54d21f3ef6dcad 2
b839d582
SK
77 "./_build/dups.ml"
78 "./dups.ml"
5cba374c
SK
79Time : 0.031084 seconds
80Considered : 121
81Hashed : 45
82Skipped due to 0 size : 2
83Skipped due to unique size : 74
84Ignored due to regex match : 0
85
86```
87Note that the report lines are written to `stderr`, so that `stdout` is safely
88processable by other tools:
89
90```
91$ ./dups . 2> /dev/null
92e40e3c4330857e2762d043427b499301 2
b839d582
SK
93 "./_build/dups.native"
94 "./dups"
5cba374c
SK
953d1c679e5621b8150f54d21f3ef6dcad 2
96 "./_build/dups.ml"
97 "./dups.ml"
98
99$ ./dups . 1> /dev/null
100Time : 0.070765 seconds
101Considered : 121
102Hashed : 45
103Skipped due to 0 size : 2
104Skipped due to unique size : 74
105Ignored due to regex match : 0
106
b839d582 107```
This page took 0.02684 seconds and 5 git commands to generate.