[dups.git] / README.md

dups
====

Find duplicate files in N given directory trees. Where "duplicate" is defined
as having the same (and non-0) file size and MD5 hash digest.

It is roughly equivalent to the following one-liner (included as `dups.sh`):
```sh
find . -type f -print0 | xargs -0 -P $(nproc) -I % md5sum % | awk '{digest = $1;  sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf "    %s\n", paths[digest, i]} } } }'
```

which, when indented, looks like:
```sh
find . -type f -print0 \
| xargs -0 -P $(nproc) md5sum \
| awk '
    {
        digest = $1
        sub("^" $1 " +", "")
        path = $0
        paths[digest, ++count[digest]] = path
    }

    END {
        for (digest in count) {
            n = count[digest]
            if (n > 1) {
                print(digest, n)
                for (i=1; i<=n; i++) {
                    printf "    %s\n", paths[digest, i]
                }
            }
        }
    }'
```

and works well-enough, but is painfully slow (for instance, it takes around 8
minutes to process my home directory, whereas `dups` takes around 8 seconds).

Originally, my main motivation for rewriting the above script in OCaml was
simply to avoid dealing with file paths containing newlines and spaces (the
original rewrite was substantially simpler than it currently is).

I since realized that, on the _input_, the problem is avoided by delimiting the
found paths with the null byte, rather than a newline and in AWK doing an
ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`).

However, on the _output_, I still don't know of a _simple_ way to escape the
newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf`
there's `%q`).

In any case, I now have 2 other reasons to continue with this project:
1. The speed-up is a boon to my UX (thanks in large part to optimizations
   suggested by @Aeronotix);
2. I plan to extend the feature set, which is just too-unpleasant to manage in
   a string-oriented PL:
    1. byte-by-byte comparison of files that hash to the same digest, to make
       super-duper sure they are indeed the same and do not just happen to
       collide;
    2. extend the metrics reporting;
    3. output sorting options.

Example
-------
After building, run `dups` on the current directory tree:

```sh
$ make
Finished, 0 targets (0 cached) in 00:00:00.
Finished, 5 targets (0 cached) in 00:00:00.

$ ./dups .
e40e3c4330857e2762d043427b499301 2
    "./_build/dups.native"
    "./dups"
3d1c679e5621b8150f54d21f3ef6dcad 2
    "./_build/dups.ml"
    "./dups.ml"
Time                       : 0.031084 seconds
Considered                 : 121
Hashed                     : 45
Skipped due to 0      size : 2
Skipped due to unique size : 74
Ignored due to regex match : 0

```
Note that the report lines are written to `stderr`, so that `stdout` is safely
processable by other tools:

```
$ ./dups . 2> /dev/null
e40e3c4330857e2762d043427b499301 2
    "./_build/dups.native"
    "./dups"
3d1c679e5621b8150f54d21f3ef6dcad 2
    "./_build/dups.ml"
    "./dups.ml"

$ ./dups . 1> /dev/null
Time                       : 0.070765 seconds
Considered                 : 121
Hashed                     : 45
Skipped due to 0      size : 2
Skipped due to unique size : 74
Ignored due to regex match : 0

```
Commit	Line	Data
b839d582 SK	1	dups
	2	====
	3
f41b9cdf SK	4	Find duplicate files in N given directory trees. Where "duplicate" is defined
f41b9cdf SK	5	as having the same (and non-0) file size and MD5 hash digest.
b839d582	6
f41b9cdf	7	It is roughly equivalent to the following one-liner (included as `dups.sh`):
b839d582	8	```sh
f289b74b	9	find . -type f -print0 \| xargs -0 -P $(nproc) -I % md5sum % \| awk '{digest = $1; sub("^" $1 " +", ""); path = $0; paths[digest, ++cnt[digest]] = path} END {for (digest in cnt) {n = cnt[digest]; if (n > 1) {print(digest, n); for (i=1; i<=n; i++) {printf " %s\n", paths[digest, i]} } } }'
b839d582 SK	10	```
	11
	12	which, when indented, looks like:
	13	```sh
dbb52e5c SK	14	find . -type f -print0 \
dbb52e5c SK	15	\| xargs -0 -P $(nproc) md5sum \
b839d582 SK	16	\| awk '
b839d582 SK	17	{
4d53b6c0	18	digest = $1
dbb52e5c SK	19	sub("^" $1 " +", "")
dbb52e5c SK	20	path = $0
4d53b6c0	21	paths[digest, ++count[digest]] = path
b839d582	22	}
4d53b6c0	23
b839d582	24	END {
4d53b6c0 SK	25	for (digest in count) {
4d53b6c0 SK	26	n = count[digest]
b839d582	27	if (n > 1) {
4d53b6c0	28	print(digest, n)
b839d582	29	for (i=1; i<=n; i++) {
dbb52e5c	30	printf " %s\n", paths[digest, i]
b839d582 SK	31	}
	32	}
	33	}
	34	}'
	35	```
	36
dbb52e5c SK	37	and works well-enough, but is painfully slow (for instance, it takes around 8
	38	minutes to process my home directory, whereas `dups` takes around 8 seconds).
	39
	40	Originally, my main motivation for rewriting the above script in OCaml was
	41	simply to avoid dealing with file paths containing newlines and spaces (the
	42	original rewrite was substantially simpler than it currently is).
	43
	44	I since realized that, on the _input_, the problem is avoided by delimiting the
	45	found paths with the null byte, rather than a newline and in AWK doing an
	46	ostensible `shift` of the `$0` field (`sub("^" $1 " +", "")`).
	47
	48	However, on the _output_, I still don't know of a _simple_ way to escape the
	49	newline in AWK (in OCaml, there's the `%S` in `printf` and in GNU `printf`
	50	there's `%q`).
	51
	52	In any case, I now have 2 other reasons to continue with this project:
	53	1. The speed-up is a boon to my UX (thanks in large part to optimizations
	54	suggested by @Aeronotix);
	55	2. I plan to extend the feature set, which is just too-unpleasant to manage in
	56	a string-oriented PL:
	57	1. byte-by-byte comparison of files that hash to the same digest, to make
	58	super-duper sure they are indeed the same and do not just happen to
	59	collide;
	60	2. extend the metrics reporting;
	61	3. output sorting options.
b839d582 SK	62
	63	Example
	64	-------
	65	After building, run `dups` on the current directory tree:
	66
	67	```sh
	68	$ make
	69	Finished, 0 targets (0 cached) in 00:00:00.
	70	Finished, 5 targets (0 cached) in 00:00:00.
	71
	72	$ ./dups .
5cba374c SK	73	e40e3c4330857e2762d043427b499301 2
	74	"./_build/dups.native"
	75	"./dups"
	76	3d1c679e5621b8150f54d21f3ef6dcad 2
b839d582 SK	77	"./_build/dups.ml"
b839d582 SK	78	"./dups.ml"
5cba374c SK	79	Time : 0.031084 seconds
	80	Considered : 121
	81	Hashed : 45
	82	Skipped due to 0 size : 2
	83	Skipped due to unique size : 74
	84	Ignored due to regex match : 0
	85
	86	```
	87	Note that the report lines are written to `stderr`, so that `stdout` is safely
	88	processable by other tools:
	89
	90	```
	91	$ ./dups . 2> /dev/null
	92	e40e3c4330857e2762d043427b499301 2
b839d582 SK	93	"./_build/dups.native"
b839d582 SK	94	"./dups"
5cba374c SK	95	3d1c679e5621b8150f54d21f3ef6dcad 2
	96	"./_build/dups.ml"
	97	"./dups.ml"
	98
	99	$ ./dups . 1> /dev/null
	100	Time : 0.070765 seconds
	101	Considered : 121
	102	Hashed : 45
	103	Skipped due to 0 size : 2
	104	Skipped due to unique size : 74
	105	Ignored due to regex match : 0
	106
b839d582	107	```