Re: program to remove duplicates

Liste des GroupesRevenir à cl c  
Sujet : Re: program to remove duplicates
De : fir (at) *nospam* grunge.pl (fir)
Groupes : comp.lang.c
Date : 22. Sep 2024, 11:24:06
Autres entêtes
Organisation : i2pn2 (i2pn.org)
Message-ID : <66EFF046.8010709@grunge.pl>
References : 1 2 3 4 5 6
User-Agent : Mozilla/5.0 (Windows NT 5.1; rv:27.0) Gecko/20100101 Firefox/27.0 SeaMonkey/2.24
Paul wrote:
On Sat, 9/21/2024 10:36 PM, fir wrote:
Lawrence D'Oliveiro wrote:
On Sun, 22 Sep 2024 00:18:09 +0200, fir wrote:
>
... you just need to read all files in
folder and compare it byte by byte to other files in folder of the same
size
>
For N files, that requires N × (N - 1) ÷ 2 byte-by-byte comparisons.
That’s an O(N²) algorithm.
>
There is a faster way.
>
not quite as most files have different sizes so most binary comparsions
are discarded becouse size of files differ (and those sizes i read linearly when bulding lidt of filenames)
>
what i posted seem to work ok, it odesnt work fast but hard to say if it can be optimised or it takes as long as it should..hard to say
>
The normal way to do this, is do a hash check on the
files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
as a means to compare two files. If you want to be picky about
it, stick with SHA256SUM.
>
    hashdeep64 -c MD5 -j 1 -r H: > H_sums.txt                 # Took about two minutes to run this on an SSD
                                                              # Hard drive, use -j 1 . For an SSD, use a higher thread count for -j .
>
Size   MD5SUM                             Path
>
Same size, same hash value. The size is zero. The MD5SUM in this case, is always the same (the initialization value of MD5).
>
0,     d41d8cd98f00b204e9800998ecf8427e,  H:\Users\Bullwinkle\AppData\Local\.IdentityService\AadConfigurations\AadConfiguration.lock
0,     d41d8cd98f00b204e9800998ecf8427e,  H:\Users\Bullwinkle\AppData\Local\.IdentityService\V2AccountStore.lock
>
Same size, different hash value. These are not the same file.
>
65536, a8113cfdf0227ddf1c25367ecccc894b,  H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\5213954f4433d4fbe45ed37ffc67d43fc43b54584bfd3a8d.bin
65536, 5e91acf90e90be408b6549e11865009d,  H:\Users\Bullwinkle\AppData\Local\AMD\DxCache\bf7b3ea78a361dc533a9344051255c035491d960f2bc7f31.bin
>
You can use the "sort" command, to sort by the first and second fields if you want.
Sorting the output lines, places the identical files next to one another, in the output.
>
The output of data recovery software is full of "fragments". Using
the "file" command (Windows port available, it's a Linux command),
can allow ignoring files which have no value (listed as "Data").
Recognizable files will be listed as "PNG" or "JPG" and so on.
>
A utility such as Photorec, can attempt to glue together files. Your mileage may vary.
That is a scan based file recovery method. I have not used it.
>
https://en.wikipedia.org/wiki/PhotoRec
>
    Paul
>
i do not do recovery - it removes duplicates
i mean programs such as recuva when recovers files recoves a tens of thousands and gigabytes of files with lost names and soem common types .mp3 .jpg .txt and so on, and many of those files are binary duplicates
this code i posted last just finds files that are duplicates and moves then to subdirectory 'duplicates' and it could show that half of those files or more (heavy gigabytes) are pure duplicates so some may remove the subfolder and recover space
the code i posted work ok, and if someone has windows and mingw/tdm may compiel it and check the application if wants
hashing is not necessary imo though probably could speed things up - im not strongly convinced that the probablility of misteke in this hashing is strictly zero (as i dont ever used this and would need to produce my own hashing probably).. probably its mathematically proven ists almost zero but as for now at least it is more interesting for me if the cde i posted is ok
yopu may see the main procedure of it
first it build list of files with sizes
using windows winapi function
   HANDLE h = FindFirstFile(dir, &ffd);
(ils linear say 12k times for 12 k files in folder)
then it runs square loop (12k * 12k /2 - 12k)
and compares binarly those who have same size
  int GetFileSize2(char *filename)
  {
     struct stat st;
     if (stat(filename, &st)==0) return (int) st.st_size;
     printf("\n *** error obtaining file size for %s", filename); exit(-1);
     return -1;
  }
   void bytes1_load(unsigned char* name)
   {
     int flen = GetFileSize2(name);
     FILE *f = fopen(name, "rb");
     if(!f) {  printf( "errot: cannot open file %s for load  ", name); exit(-1); }
     int loaded = fread(bytes1_resize(flen), 1, flen, f);
     fclose(f);
    }
int CompareTwoFilesByContentsAndSayIfEqual(char* file_a, char* file_b)
{
    bytes1_load(file_a);
    bytes2_load(file_b);
    if(bytes1_size!=bytes2_size) { printf("\n something is wrong compared files assumed to be be same size"); exit(-1); }
    for(unsigned int i=0; i<bytes1_size;i++)
      if(bytes1[i]!=bytes2[i]) return 0;
   return 1;
}
this has 2 elements its file load into ram and then comparsions
(the reading files is redundant as i got the info from FindFirstFile(dir, &ffd); winapi functions, but maybe to be sure i
read it also form thsi stat() function again
and then finally i got a linear pary to move that ones on the list marked as duplicates to subfolder
  int FolderExist(char *name)
  {
     static struct stat st;
     if(stat(name, &st) == 0 && S_ISDIR(st.st_mode)) return 1;
     return 0;
  }
int duplicates_moved = 0;
void MoveDuplicateToSubdirectory(char*name)
{
    if(!FolderExist("duplicates"))
    {
     int n =  _mkdir("duplicates");
     if(n) { printf ("\n i cannot create subfolder"); exit(-1); }
    }
   static char renamed[1000];
   int n = snprintf(renamed, sizeof(renamed), "duplicates\%s", name);
   if(rename(name, renamed))
    {printf("\n rename %s %s failed", name, renamed); exit(-1);}
   duplicates_moved++;
}
im not sure if some of tis functions are not slow and there is an element of redundancy calling    if(!FolderExist("duplicates"))
many times as it would be normal "ram based" not disk related function
- but its probably okay i guess (and thsi disk related function i hope not really activates disk but hopefully only read some ram about it)

Date Sujet#  Auteur
21 Sep 24 * program to remove duplicates28fir
21 Sep 24 +* Re: program to remove duplicates5fir
21 Sep 24 i`* Re: program to remove duplicates4fir
21 Sep 24 i `* Re: program to remove duplicates3fir
21 Sep 24 i  `* Re: program to remove duplicates2fir
22 Sep 24 i   `- Re: program to remove duplicates1fir
21 Sep 24 +* Re: program to remove duplicates19Chris M. Thomasson
22 Sep 24 i`* Re: program to remove duplicates18fir
22 Sep 24 i +- Re: program to remove duplicates1Chris M. Thomasson
22 Sep 24 i `* Re: program to remove duplicates16Lawrence D'Oliveiro
22 Sep 24 i  +* Re: program to remove duplicates14fir
22 Sep 24 i  i+- Re: program to remove duplicates1Chris M. Thomasson
22 Sep 24 i  i+- Re: program to remove duplicates1Lawrence D'Oliveiro
22 Sep 24 i  i`* Re: program to remove duplicates11Paul
22 Sep 24 i  i +* Re: program to remove duplicates9fir
22 Sep 24 i  i i`* Re: program to remove duplicates8Bart
22 Sep 24 i  i i +* Re: program to remove duplicates3fir
22 Sep 24 i  i i i`* Re: program to remove duplicates2fir
22 Sep 24 i  i i i `- Re: program to remove duplicates1fir
22 Sep 24 i  i i `* Re: program to remove duplicates4fir
22 Sep 24 i  i i  `* Re: program to remove duplicates3fir
22 Sep 24 i  i i   `* Re: program to remove duplicates2fir
22 Sep 24 i  i i    `- Re: program to remove duplicates1fir
22 Sep 24 i  i `- Re: program to remove duplicates1Chris M. Thomasson
22 Sep 24 i  `- Re: program to remove duplicates1DFS
22 Sep 24 +- Re: program to remove duplicates1Lawrence D'Oliveiro
1 Oct 24 `* Re: program to remove duplicates2Josef Möllers
1 Oct 24  `- Off Topic (Was: program to remove duplicates)1Kenny McCormack

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal