Skip to content

A program for python utilizing "Streamlit" to make UI, which can detect duplicate files in a folder. The MD5-hash algorithm is used for file comparison.

License

Notifications You must be signed in to change notification settings

BertramAakjaer/Python_duplicate_file_checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python duplicate checker using streamlit

Note
If you're using any other OS than windows this will probaly not work, because it is depending on the python libary os which is made for windows only, see this doc for more info

Key Features

  • Use two different folders (One for the duplicates and another for dumping the registered duplicates into)

  • Duplicates will not be deleted but moved to ensure that the wrong files isn't deleted

  • Loading bar showing the progress of searched files compared to files to search

    • With text that shows an estimate of time to run the task
  • End text that shows the checking has been complete and how many files have been transferred

  • Utilizing Message Digest (MD5) to create hashes ensuring that at least $2^{64}$ unique files can be used while still being blazing fast

  • Streamlit library to UI

Installation and setup

To clone and run this application, you'll need Git and Python 1.12.2 many other python version should work as well, but 1.12.2 was used for the creation of the script.

From your command line:

# Clone this repository
git clone https://github.com/BertramAakjaer/Python_duplicate_file_checker.git

# Install streamlit library
pip install streamlit

# Enter the directory
cd Python_duplicate_checker-Streamlit/

# Run the script using streamlit and it should open in your default browser
python -m streamlit run main.py

Usage

Image

1. Firstly go to your prefered file explorer and find the path to folder including duplicated files. Paste this into the first textbox named "Folder with duplicates".

2. Then find a temporary folder to fill with all the registered duplicates and paste the path into the textbox named "Destination folder".

3. Then you're ready and should just be able to press the button "Start duplicate check" and the program should do the rest for you.

Image Image

Logic behind MD5 hashes

A MD5 hash is used instead of alternative like a SHA1(Secure hash algoritm), because of some distinct aspects that have led to the decision. The main point can be read under here:

  • Speed MD5 hashes are easily computed making the stress on the machine computing the hashes less impactful than other hash algorithms. This is really the main reason i choose a the MD5-hash algorithm, because it in general is one of the fastest to use while still having a decent hash size.

  • Hash size The hashes generated are big enough to create $2^{64}$ unique hashes which in general should be enough for this use case were you rarely have anywhere close to this number of files to compare.

  • Security MD5 hashes aren't that secure for use with passwords or other types of sensitive data, but when used for comparing files security aren't a factor that should be worrisome, so this tradeoff for greater speed are worth the price.


The point are gathered from an article comparing MD5 hashes and SHA1 hashes, it can be read here:

License

This project is licensed under the MIT License.

Socials

aakjaer.site  ·  GitHub @BertramAakjær  ·  Twitter @BertramAakjær

About

A program for python utilizing "Streamlit" to make UI, which can detect duplicate files in a folder. The MD5-hash algorithm is used for file comparison.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages