Parallelize file hashing
[dups.git] / README.md
CommitLineData
b839d582
SK
1dups
2====
3
4Find duplicate files in given directory trees. Where "duplicate" is defined as
5cba374c 5having the same (and non-0) file size and MD5 hash digest.
b839d582
SK
6
7It is roughly equivalent to the following one-liner:
8```sh
4d53b6c0 9find . -type f -exec md5sum '{}' \; | awk '{digest = $1; path = $2; paths[digest, ++count[digest]] = path} END {for (digest in count) {n = count[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {print " ", paths[digest, i]} } } }'
b839d582
SK
10```
11
12which, when indented, looks like:
13```sh
14find . -type f -exec md5sum '{}' \; \
15| awk '
16 {
4d53b6c0
SK
17 digest = $1
18 path = $2
19 paths[digest, ++count[digest]] = path
b839d582 20 }
4d53b6c0 21
b839d582 22 END {
4d53b6c0
SK
23 for (digest in count) {
24 n = count[digest]
b839d582 25 if (n > 1) {
4d53b6c0 26 print(digest, n)
b839d582 27 for (i=1; i<=n; i++) {
4d53b6c0 28 print " ", paths[digest, i]
b839d582
SK
29 }
30 }
31 }
32 }'
33```
34
35and works well-enough, until you start getting weird file paths that are more
36of a pain to handle quoting for than re-writing this thing in OCaml :)
37
38Example
39-------
40After building, run `dups` on the current directory tree:
41
42```sh
43$ make
44Finished, 0 targets (0 cached) in 00:00:00.
45Finished, 5 targets (0 cached) in 00:00:00.
46
47$ ./dups .
5cba374c
SK
48e40e3c4330857e2762d043427b499301 2
49 "./_build/dups.native"
50 "./dups"
513d1c679e5621b8150f54d21f3ef6dcad 2
b839d582
SK
52 "./_build/dups.ml"
53 "./dups.ml"
5cba374c
SK
54Time : 0.031084 seconds
55Considered : 121
56Hashed : 45
57Skipped due to 0 size : 2
58Skipped due to unique size : 74
59Ignored due to regex match : 0
60
61```
62Note that the report lines are written to `stderr`, so that `stdout` is safely
63processable by other tools:
64
65```
66$ ./dups . 2> /dev/null
67e40e3c4330857e2762d043427b499301 2
b839d582
SK
68 "./_build/dups.native"
69 "./dups"
5cba374c
SK
703d1c679e5621b8150f54d21f3ef6dcad 2
71 "./_build/dups.ml"
72 "./dups.ml"
73
74$ ./dups . 1> /dev/null
75Time : 0.070765 seconds
76Considered : 121
77Hashed : 45
78Skipped due to 0 size : 2
79Skipped due to unique size : 74
80Ignored due to regex match : 0
81
b839d582 82```
This page took 0.021942 seconds and 4 git commands to generate.