Skip to content

Theory and Python code to understand Imbalanced and missing data and how to deal with them.

Notifications You must be signed in to change notification settings

Naviden/Data-Quality-Issues

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Handling Imbalanced and Missing Data

Overview

This repository is dedicated to exploring the theory behind imbalanced and missing data in machine learning datasets and providing practical solutions to deal with these issues. Through comprehensive Jupyter Notebooks, we demonstrate techniques and strategies to mitigate the impact of imbalanced and missing data on model performance.

Contents

  • Theory: A detailed explanation of what constitutes imbalanced and missing data, why it poses a problem for machine learning models, and the theoretical foundation for the methods used to address these issues.
  • Practical Guides: Jupyter Notebooks that illustrate step-by-step processes for handling imbalanced and missing data, including code examples and explanations.

Getting Started

Prerequisites

Ensure you have the following installed:

  • Python 3.x
  • Jupyter Notebook
  • Required Python packages: numpy, pandas, scikit-learn, imbalanced-learn, matplotlib, seaborn
  • datasets: All datasets are publicly available but in case you don't want to search them manually, you can find all of them here
  • Table of Contents (2) : While not necessary, I recommend installing this extension of Jupyter notebook for a faster navigation.

Installation

Clone this repository to your local machine:

git clone https://github.com/Naviden/Data-Quality-Issues.git

Content Overview

1. Theory on Imbalanced Data

  • Definition and implications
  • Techniques for handling imbalanced data:
    • Over-sampling minority class
    • Under-sampling majority class
    • Synthetic data generation (SMOTE)

2. Theory on Missing Data

  • Types of missing data: MCAR, MAR, MNAR
  • Impact on analysis
  • Strategies for dealing with missing data:
    • Imputation methods
    • Dropping missing values
    • Using algorithms that support missing values

Contributing

We welcome contributions! Please feel free to submit pull requests with improvements or new features.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Theory and Python code to understand Imbalanced and missing data and how to deal with them.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published