By David Cuervo
Build a classifier to accurately predict the condition of water wells in Tanzania.
- Folder containing the original data sets from DrivenData
- CSV of cleaned data
- Data_Cleaning Notebook: contains the code for exploring and cleaning the original data set
- Modeling Notebook: contains code for building the best classifier
- PNG image of Tanzania and the wells plotted
- PDF of project presentation
- Rubric for Module 3 Project
- Began by downloading the data from DrivenData
- Worked through data set column by column to deal with missing data, outliers, and catigorical variables
- Exported cleaned data and used it so begin building classifiers
- Used Boruta as feature selection
- Used the features selected through Boruta to build logistic regression, decision tree, and random forest models
- Decision tree was the most accurate model, 75%
- Construction year, waterpoint type, and GPS height were the most important features in the model
- Moving forward, prioritize older wells and uncommon types of wells