Page 1 of 1

Why is there no standard Duplicate file finder in Puppy?

Posted: Sat 23 Sep 2017, 01:57
by Puppyt
THAT old chestnut. Rears its ugly head periodically in the forums, but has been quiet lately despite my searches revealing no recent advances on command-line offerings of fdupes, fslint, rdupes etc. There is a gui for dupf that I downloaded via the Ubuntu repos with PPM (dupfinder 0.8 ) but I am returning an error with the 32-bit libraries on board

Code: Select all

dupfgui: error while loading shared libraries: libtiff.so.4: wrong ELF class: ELFCLASS64   
. I don't know how to address that problem - chasing realworld deadlines again and its just so frustrating that there doesn't appear to be a straightforward solution in Puppy Linux.
-Aha - finally got fdupes working, had problems with spaces in directory names that was borking my commands. I'll see how that goes. jdupes is a new fork on github that claims to be 10x faster, but that is a battle that I can't choose at this point in time.

I intend to revisit this topic periodically with results of my spring-cleaning attempts more thoroughly, but anyone with a satisfactory duplicate file finder solution (preferably GUI) please post your workarounds. I guess my main gripe is that it seems strange not to have a built-in application within Puppy. I wonder how difficult it would be to expand pFind's "advanced" functions into duplicate file finding??
Thanks in advance for your input.

Posted: Sat 23 Sep 2017, 11:10
by trapster
This will list dupe files in a console.
Run this in the directory you want to search,

Code: Select all

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Posted: Sun 24 Sep 2017, 02:57
by MochiMoppel
trapster wrote:Run this in the directory you want to search
...and go out for lunch :lol:
At least on my system this is VERY slow and I wonder why this monster was overwhelmingly voted as best solution on commandlinefu.com.
The claim that it "saves time by comparing size first, then md5sum" sounds convincing but unfortunately isn't true. My result for directory /usr/share:
real 2m56.738s
user 0m33.748s
sys 2m22.264s

The No.3 on the list is much shorter and faster. Slightly modified for camparibility:

Code: Select all

find -not -empty -type f -exec md5sum '{}' ';' | sort | uniq -w32 --all-repeated=separate
real 0m24.203s
user 0m0.767s
sys 0m2.293s

That was much better, but still not good. This would be really fast:

Code: Select all

find  -not -empty -type f -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
real 0m1.235s
user 0m0.627s
sys 0m0.737s

.

Posted: Sun 24 Sep 2017, 02:58
by Puppyt
Thanks trapster :) Stil on the hunt for a GUI of some sort for assessing which backups/duplicates to delete - dupfgui might be the best of a (?) Linux bunch.

UPDATE: Thanks for the suggested tweaks to the code, MochiMoppel - we posted simultaneously, it seems...

Dupfinder installed tahrpup 5.8.3

Posted: Wed 25 Jul 2018, 05:15
by hamoudoudou
Dupfinder installed tahrpup 5.8.3 (home made Puplet) to check files sored on usb . l

Posted: Wed 17 Jun 2020, 13:06
by MochiMoppel
I wrote:That was much better, but still not good. This would be really fast:

Code: Select all

find  -not -empty -type f -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
I hate to quote myself but in this case I have to because I was wrong. My code is fast when there are many relatively small files to compare like in /root or /usr/share. However when scanning directories with relatively large files like audio or image files my approach to test every file is much too slow. In such cases it makes sense to search for files with same size and then - in a second step - check only these same-size-files for duplicates.

My problem with the code posted by trapster is the apparent inefficiency. Uses find command multiple times, first to find same-size-files and then for each size to find matching files.

After some tinkering I found a way to use find only once. This makes the code almost 10 times faster than the trapster/commandlinefu.com code:

Code: Select all

find  -not -empty -type f -printf '%12s\t%p\n' | sort -n | uniq -Dw12 | cut -f2- | xargs  -d '\n' md5sum | sort | uniq -w32 --all-repeated=separate

Posted: Thu 18 Jun 2020, 00:27
by Puppyt
Thanks MochiMoppel :) Much appreciated!