Article <vcoh04$24ioi$1@dont-email.me>

Deutsch English Français Italiano
<vcoh04$24ioi$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Paul <nospam@needed.invalid>
Newsgroups: comp.lang.c
Subject: Re: program to remove duplicates
Date: Sun, 22 Sep 2024 03:29:08 -0400
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <vcoh04$24ioi$1@dont-email.me>
References: <ecb505e80df00f96c99d813c534177115f3d2b15@i2pn2.org>
 <vcnfbi$1ocq6$1@dont-email.me>
 <8630bec343aec589a6cdc42bb19dae28120ceabf@i2pn2.org>
 <vcnu3p$1vkui$2@dont-email.me> <66EF8293.30803@grunge.pl>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 22 Sep 2024 09:29:09 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="d721958c9a184a47dffe671c8102b6da";
	logging-data="2247442"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX180RsHa4YFdPTeKO2zB+EL/uL9giPp9Avw="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:9cKhIJm2WBqaJQHQE8AGTU/nGEc=
Content-Language: en-US
In-Reply-To: <66EF8293.30803@grunge.pl>
Bytes: 3770

On Sat, 9/21/2024 10:36 PM, fir wrote:
> Lawrence D'Oliveiro wrote:
>> On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:
>>
>>> ... you just need to read all files in
>>> folder and compare it byte by byte to other files in folder of the same
>>> size
>>
>> For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
>> That’s an O(N²) algorithm.
>>
>> There is a faster way.
>>
> not quite as most files have different sizes so most binary comparsions
> are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)
> 
> what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say

The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.

   hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt                 # Took about two minutes to run this on an SSD
                                                             # Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .

Size   MD5SUM                             Path

Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).

0,     d41d8cd98f00b204e9800998ecf8427e,  H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
0,     d41d8cd98f00b204e9800998ecf8427e,  H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock

Same size, different hash value. These are not the same file.

65536, a8113cfdf0227ddf1c25367ecccc894b,  H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
65536, 5e91acf90e90be408b6549e11865009d,  H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin

You can use the "sort" command, to sort by the first and second fields if you want.
Sorting the output lines, places the identical files next to one another, in the output.

The output of data recovery software is full of "fragments". Using
the "file" command (Windows port available, it's a Linux command),
can allow ignoring files which have no value (listed as "Data").
Recognizable files will be listed as "PNG" or "JPG" and so on.

A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
That is a scan based file recovery method. I have not used it.

https://en.wikipedia.org/wiki/PhotoRec

   Paul