Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request : Save hash to Report DB. #25

Open
rosyth opened this issue May 8, 2020 · 9 comments
Open

Feature Request : Save hash to Report DB. #25

rosyth opened this issue May 8, 2020 · 9 comments

Comments

@rosyth
Copy link

rosyth commented May 8, 2020

Since hashing already is getting done, why not save the hash to the report database.
This would allow me to merge by hash two separate dupd runs on different external drives.

I can import the sqlitedb's into python/pandas (since I'm not familiar with SQL) merge them and get a new list of possible duplicates.
eg..
`
import pandas as pd

import sqlite3

con = sqlite3.connect("dupd.db3")

dupx = pd.read_sql('SELECT * FROM duplicates WHERE each_size > 10000;', con)
`
I've tried to modify the code myself, to add hashes, but not having used C for 20 years, it's not been very successful.

I suspect it would not be difficult, and possibly quite useful to other users too.

@rosyth
Copy link
Author

rosyth commented May 9, 2020

Well eventually I've done it with some hack job.
It would be nicer if it were done properly by the person who actually knows what they're doing.
Based on the latest release : 2.0-dev where, 'dupd_latest/dupd' is the release version and 'dupd' is the modified.

diff -rw dupd_latest/dupd dupd --exclude=tests --exclude=*.git
Only in dupd: build
Only in dupd: dupd
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/dbops.c dupd/src/dbops.c
112c112
<                         "each_size INTEGER, paths TEXT)");
---
>                         "each_size INTEGER, paths TEXT, hash TEXT )");
420c420
< void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths)
---
> void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths, char * hash)
422c422,425
<   const char * sql = "INSERT INTO duplicates (count, each_size, paths) "
---
> 
>   const char * sqly = "INSERT INTO duplicates (count, each_size, paths, hash) "
>                      "VALUES(?, ?, ?, ?)";
>   const char * sqlx = "INSERT INTO duplicates (count, each_size, paths) "
424a428,429
>   int hash_len = strlen(hash);
>   const char * sql = ( hash == 0 ? sqlx : sqly );
440a446,451
> 
>   if( hash != 0 ) {
>     // printf("++++++++++++++ Hash %d -> %s\n", hash_len, hash);
>     rv = sqlite3_bind_text(stmt_duplicate_to_db, 4, hash, -1, SQLITE_STATIC);
>     rvchk(rv, SQLITE_OK, "Can't bind file hash: %s\n", dbh);
>   }
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/dbops.h dupd/src/dbops.h
135c135
< void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths);
---
> void duplicate_to_db(sqlite3 * dbh, int count, uint64_t size, char * paths, char * hash);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/filecompare.c dupd/src/filecompare.c
76c76
<   duplicate_to_db(dbh, 2, size, paths);
---
>   duplicate_to_db(dbh, 2, size, paths, 0);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/hashlist.c dupd/src/hashlist.c
326a327,332
>   char hash_out[HASH_MAX_BUFSIZE];
>   char * strhash;
>   char * strp ;
>   char * hashp = hash_out;
>   int hsize = hash_get_bufsize(hash_function);
>   
372,373d377
<           int hsize = hash_get_bufsize(hash_function);
<           char hash_out[HASH_MAX_BUFSIZE];
382a387
>         strp = memstring("hash", p->hash, hsize);
389,390c394,395
<       duplicate_to_db(dbh, p->next_index, size, pbi->buf);
< 
---
>       duplicate_to_db(dbh, p->next_index, size, pbi->buf, strp);
>       free(strp);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/refresh.c dupd/src/refresh.c
132c132
<         duplicate_to_db(dbh, new_entry_count, entry_each_size, new_list);
---
>         duplicate_to_db(dbh, new_entry_count, entry_each_size, new_list, 0);
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/utils.c dupd/src/utils.c
300a301
> 
307c308
<     printf("%s: ", text);
---
>     printf("%s: %d: ", text, bytes);
314a316,332
> }
> 
> char * memstring(char * text, char * ptr, int bytes)
> {
>   int i;
>   unsigned char * p = (unsigned char *)ptr;
>   int space = ( strlen(ptr)*3 + 2 );
>   char * optr = (char *) malloc((1024) * sizeof(char));
>   char * xptr = optr ;
> 
>   for (i=0; i<bytes; i++) {
>     xptr += sprintf(xptr, "%02x ", *p++);
>   }
>   //printf("\n-----------> memstring >> %s <-------------\n", optr);
>   //memdump(text, ptr, bytes);
>   //printf("~~~~~~~~~~~~\n");
>   return optr;
diff -rw '--exclude=tests' '--exclude=*.git' dupd_latest/dupd/src/utils.h dupd/src/utils.h
239a240,241
> char * memstring(char * text, char * ptr, int bytes);
> 

So, not much changed, but then, I'm not sure about the hidden (if any) side effects.

@jvirkki
Copy link
Owner

jvirkki commented May 11, 2020

Thanks for using dupd!

Saving the hashes of duplicates is easy enough, but I'm not sure if it is useful?

Hashes are computed only for files known to be duplicates (if they can be rejected earlier, the full file is not read so the hash isn't computed).

If you compare the known-duplicate hashes from two different systems, there is no guarantee that will find any duplicates even if they exist. That's because files which are duplicates on the two systems won't have a hash present unless both of them also have duplicates in the local system. So comparing across systems that way will only match a somewhat random subset of files, if any.

(If the two external drives are mounted on the same system, run dupd with multiple -p options to point at both paths which will solve that use case.)

In general, to do a duplicate find across separate systems requires computing all the hashes for all files. Easy enough with just find & sha1sum but it'll be very slow.

@priyadarshan
Copy link

If I may butt in, one use-case for storing hashes for all files: checking for duplicates on completely separate systems, especially with completely different paths, with the intent of keeping certain subsets on chosen machines (ie, keeping some parts duplicated, and others not).

Admittedly an uncommon use-case one would not expect dupd to solve. Still, it is a use-case which the non-profit I volunteer for has been facing for some time.

@rosyth
Copy link
Author

rosyth commented May 11, 2020

Hi, Yes of course you are correct, that comparing individual dupd runs with a hash will only catch duplicates on both drives. However I was considering creating a separate file list with xxhash output to compare to the original too, (also to pump into pandas). As you say something like,
find . -type f -printf "%s:%p:" -exec xxh64sum {} \; > filelist.xxh64
will do the job. No need for crypto crcs for this type of job.
But you are probably correct that my use case is a bit muddled, but that's from an accumulation of several backup drives and a few disk failures over the last five years that I did nothing with and I now want to re-organise (lockdown :-)). Still apart from a little extra filespace overhead, it doesn't do any harm to save the hash too.


And thanks for the clarification, it wasn't quite clear to me that the --path switch was cumulative.

@jvirkki
Copy link
Owner

jvirkki commented May 11, 2020

Bit of trivia: dupd is named as a daemon (ends in 'd') even though it is not, because during initial implementation my plan was for it to be a daemon which coordinates duplicate finding across systems. That turned out to be too slow to be interesting so I focused on the local disk case but didn't change the name.

I'd still love to solve for the multiple systems problem if there is an efficient way that is much better than simply using find | sort | uniq.

@rosyth - dupd currently does save the hash of some files, but only large ones. You could get these from the .dupd_cache db with something like:

select files.path,hash from files,hashes where hashes.id = files.id;

There's a performance cost to saving these though, so they're only saved for large files.

@jvirkki
Copy link
Owner

jvirkki commented May 11, 2020

And thanks for the clarification, it wasn't quite clear to me that the --path switch was cumulative.

The manpage covers this, but if there's anything that the manpage doesn't make clear please let me know so I can add more clarity.

@priyadarshan
Copy link

I'd still love to solve for the multiple systems problem if there is an efficient way that is much better than simply using find | sort | uniq.

This is inspiring to hear, as it was the same direction I was heading.

Would it be fine to open a new ticket for your consideration, presenting our use case, or shall I clarify here?

@jvirkki
Copy link
Owner

jvirkki commented May 12, 2020

Feel free to file another ticket with specific use case details.

I'm not entirely convinced it's possible though. Trying to coordinate partial file matches over the network (particularly if more than two systems are involved) would likely introduce so much delay that it's just faster to hash everything and compare later. At that point dupd doesn't add any value since it can be done in a trivial shell script. But I'd love to be proved wrong.

@rosyth
Copy link
Author

rosyth commented May 12, 2020

And thanks for the clarification, it wasn't quite clear to me that the --path switch was cumulative.

The manpage covers this, but if there's anything that the manpage doesn't make clear please let me know so I can add more clarity.

Yes, I see that now, thanks, RTFM always applies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants