Derive data deduplication rate of a file in C.
- run
make
or command as below:
gcc program.c -o program -L/usr/local/opt/[email protected]/lib -I/usr/local/opt/[email protected]/include -lcrypto -lz -lm
- Execute this program with variable chunksize, hashtablesize, and file(s).
Modify argv[ ] in main function to enable multiple files input.
./program chunksize(KB) hashtablesize file1 file2 ...
Example with only a file. (Note. If the target is a folder, tar it)
./dedupe 8 1000000 linux-5.3
- Build a hash table with hashtablesize entries
- Read file(s) in chunksize byte stream
- Generate fingerprints from each chunk by SHA-1
- Fetch prefix(32bits) as hashcode of each fingerprint(160bits)
- Insert hashcode into hash table and store its fingerprint in lined list to avoid false positive
- When collision happens, traverse the link and compare their fingerprints.
- Derive dedup rate:
1 - (unique_chunks/total_chunks)