Module 3 Project

By David Cuervo

Background

Build a classifier to accurately predict the condition of water wells in Tanzania.

Contents of Repository

Folder containing the original data sets from DrivenData
CSV of cleaned data
Data_Cleaning Notebook: contains the code for exploring and cleaning the original data set
Modeling Notebook: contains code for building the best classifier
PNG image of Tanzania and the wells plotted
PDF of project presentation
Rubric for Module 3 Project

Approach

Began by downloading the data from DrivenData
Worked through data set column by column to deal with missing data, outliers, and catigorical variables
Exported cleaned data and used it so begin building classifiers
Used Boruta as feature selection
Used the features selected through Boruta to build logistic regression, decision tree, and random forest models

Conclusions

Decision tree was the most accurate model, 75%

Construction year, waterpoint type, and GPS height were the most important features in the model

Moving forward, prioritize older wells and uncommon types of wells