This repository is dedicated to exploring the theory behind imbalanced and missing data in machine learning datasets and providing practical solutions to deal with these issues. Through comprehensive Jupyter Notebooks, we demonstrate techniques and strategies to mitigate the impact of imbalanced and missing data on model performance.
- Theory: A detailed explanation of what constitutes imbalanced and missing data, why it poses a problem for machine learning models, and the theoretical foundation for the methods used to address these issues.
- Practical Guides: Jupyter Notebooks that illustrate step-by-step processes for handling imbalanced and missing data, including code examples and explanations.
Ensure you have the following installed:
- Python 3.x
- Jupyter Notebook
- Required Python packages:
numpy
,pandas
,scikit-learn
,imbalanced-learn
,matplotlib
,seaborn
- datasets: All datasets are publicly available but in case you don't want to search them manually, you can find all of them here
- Table of Contents (2) : While not necessary, I recommend installing this extension of Jupyter notebook for a faster navigation.
Clone this repository to your local machine:
git clone https://github.com/Naviden/Data-Quality-Issues.git
- Definition and implications
- Techniques for handling imbalanced data:
- Over-sampling minority class
- Under-sampling majority class
- Synthetic data generation (SMOTE)
- Types of missing data: MCAR, MAR, MNAR
- Impact on analysis
- Strategies for dealing with missing data:
- Imputation methods
- Dropping missing values
- Using algorithms that support missing values
We welcome contributions! Please feel free to submit pull requests with improvements or new features.
This project is licensed under the MIT License - see the LICENSE file for details.