Diabetes Prediction Project

Diabetes Prediction from CDC Behavioral Risk Factor Surveillance System (BRFSS) Survey data

See the Project Instructions for more details about the project requirements
Please see the class presentation on Diabetes Predictions from CDC 2015 & 2021 BRFSS Survey Data

Team

Jeff Flachman
Ava Lee
Elia Porter

Project Checklist:

Please see the project checklist for artifacts supporting the project objectives.

Executive Summary

This project aims to analyze the factors contributing to the prevalence of diabetes in the United State and determine if they provide some predictive value in determining a diagnosis of diabetes.

A dataset pulled from the 2015 BRFSS was available on the UC Irvine Machine Learning Repository. This dataset was cleaned from the 2015 BRFSS survey data. The team also pulled an cleaned data from the 2021 BRFSS survey

Project Overview

Diabetes is the eighth leading cause of death in the United Stages. But many rank it second behind heart disease as a chronic illness that leads to death. Diabetes also has a daily implact of those who live with it.

The team was interested in diabetes predictions using data from the The CDC Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an annual phone survey of 300K-400K respondents.

A dataset pulled from the 2015 BRFSS was available on the UC Irvine Machine Learning Repository. This dataset had already been cleaned from the 2015 BRFSS survey data. The 2015 dataset was older and already cleaned. Therefore, the team also pulled and cleaned the 2021 BRFSS survey.

Diabetes risk factors listed below were used to select survey results from BRFSS dataset. The 2015 dataset contained 21 features. The team selected 35 features from the 2021 dataset.

Two targets were evaluated:

target 1 (0,1,2): 0: no diabetes, 1: pre-diabetes, 2: diabetes
target 2 (binary:0,1): 0: no diabetes, 1: diabetes

Classification models were trained and the metrics were computed. In addition, alternate scaling and sampling techniques were used to handle the imbalance in the datasets.

In all, 63 configurations of binary/012, scaling & sampling method, and models were trained and the metrics were computed for each base dataset (2015 & 2021).

The metrics were evaluated and a few targeted dataset configurations were selected to optimize. These included: - binary target, standard scalar, randomUnderSample data with models: - LogisticRegression optimized with with RandomizedSearchCV - AdaBoost optimized with with RandomizedSearchCV

For more information, see the Details below.

Project Details:

Ideation

Potential Datasets Evaluated

The team brainstormed multiple dataset options for this project. Some of the datasets reviewed are listed in the datasets files. The team reviewed candidate datasets for abalone, mushroom, bike sharing, and diabetes.

The team was most interested in diabetes predictions using data from the The CDC Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an annual phone survey of 300K-400K respondents.

As a starting point, a dataset pulled from the 2015 BRFSS was used from the UC Irvine Machine Learning Repository

Project directory structure

README.md - This file provides a description of the project

Directories:

Note: The project classification analysis and results of the CDC Behavioral Risk Factor Surveillance System (BRFSS) Survey data are contained in the brfss_2015 and brfss_2021 directories.

directory	description
`brfss_2015`	Contains all the anlaysis of the 2015 BRFSS data. A description of the files in this directory can be found in the brfss_2015 README
`brfss_2021`	Contains all the anlaysis of the 2021 BRFSS data. A description of the files in this directory can be found in the brfss_2021 README
`data_cleaning`	Contains all the work the read codebooks, refine features, transform and modify feature values. This work was then moved into the brfss_2015 and brfss_2021 directories as signle 1_....ipynb files which pulled and cleaned the data for each year
`pkgs`	python files containing pipline code writting for data
`docs`	other markdown files referenced by this README and other docs
`imgs`	graphics included in documentation markdowns
`prototype_ava`	Prototype code written by Ava Lee
`prototype_elia`	Prototype code written by Elia Porter
`prototype_jeff`	Prototype code written by Jeff Flachman

Note: Additonal READMEs are available in some subdirectories.

Feature Engineering

Understanding Diabetes

I order to understand the features, it is important to understand the risks and indicators of diabetes.

Risk Factors

There are many risk factors for developing type 2 diabetes, including:

Age: Being over 40 increases your risk.
Family history: Having a parent, sibling, or other relative with type 1 or type 2 diabetes increases your risk.
Ethnicity: People of certain races and ethnicities, including African Americans, Hispanics, American Indians, and Asian-Americans, are more likely to develop type 2 diabetes.
Inactivity: The less active you are, the greater your risk.
Weight: Being overweight or obese increases your risk. You can estimate your risk by measuring your waist circumference. Men have a higher risk if their waist circumference is more than 40 inches, while women who are not pregnant have a higher risk if their waist circumference is more than 35 inches.
Blood pressure: High blood pressure can lead to insulin resistance and eventually type 2 diabetes.
Cholesterol: High cholesterol can raise your risk for diabetes and heart disease.
Smoking: Smokers are more 30-40% more likely than non-smokers to develop type 2 diabetes.

Diabetes Indicators / Symptoms

Diabetes is a chronic condition that can be diagnosed by a medical professional. While it often has no symptoms, some indicators include:

Urination: Frequent urination, especially at night
Thirst: Excessive thirst
Hunger: Increased hunger, even when eating
Weight loss: Unintentional weight loss
Fatigue: Feeling more tired than usual
Vision: Blurred vision
Wounds: Cuts and bruises that take longer to heal
Skin: Itchy skin or genital itching
Infections: Urinary tract infections (UTIs) or yeast infections
Sensations: Unusual sensations like tingling, burning, or pricklin

Feature Selection

A list of features was pulled from the UCI/Kaggle documentation on the 2015 dataset. In addition, the 2021 codebook was imported and parsed. See the work in data_cleaning. The features in the codebook were evaluated, selected and a summare of the selected 2021 features was written to a file.

Key features relevant to diabetes analysis were selected. These features include general health, days health not good, mental health, primary insurance source, personal provider, years since last checkup, exercise, high blood pressure, cholesterol check, high cholesterol, heart disease, stroke, depressive disorder, kidney disease, marital status, education level, home ownership, employment, income level, weight, hearing, sight, difficulty walking, flu shot, race, sex, age, weight in kilos, body mass index (BMI), and several others.

Data Cleaning

A contributing factor to including 2021 data was that the features on 2015 data on UCI/Kaggle were already selected and cleaned. Therefore, the team put in a considerable effort to automate including other CDC BRFSS survy years and clean the data. Post 2015, 2021 had the most features related to the risk factors for diabetes. Thus 2021 was selected as the best year to clean. A list of years and features counts pulled from the CDC website is recorded in the CDC - BRFSS Datasets by year file.

The CDC also has a list of Diabetes indicators for machine learning

The files ml_clean_config.py and ml_clean_features.py contain the functions written to handle processing the codebooks, selecting features and cleaning the data.

Cleaning
The CDC BRFSS Survery data responses were already provided a numeric values. Therefore, Get_dummies, OneHotEncoder and OrdinalEncoder were not required. However, it was necessary to do some cleaning. Some responses were unknown or refused and those rows needed to be dropped. Other values needed to be scaled (i.e. weight of 4015 kg needed to be scaled to 40.15 kg). Finally, the numeric values for some responses needed to be transformed. i.e. for exercise, the value 88 (no days) was transformed to 0 days, where 1-30 was number of days of month of exercise.

Substantial time was spent productionizing (pipeline) the processing of the CDC Codebooks, simplifying feature extraction and feature cleaning and imputation. The files supporting processing the codebooks and cleaning the data can be found in the data_cleaning directory. Ultimately, the feature descriptions for the 2021 dataset were automatically generated into the following file: 2021 features. A dictionary based configuration file was used to define the operations to be made on each feature in the dataset and a function then performed imputation on all features in a single function call.

Finally, the 2021 data originally had 55 feature. A correlation matrix was plotted and the highest correlated feature to features were reviewed for potential duplication of information. For example, the dataset started with 5 Race feature and this was reduced to 1 Race feature. Several education features were reduced to 1. The feature reduction as well as the other feature engineering steps can be found in the 2021 Data Cleaning Notebook. The final 2021 feature set has 36 feature and the target (diabetes)

This cleaning process produced the base data used in the Data Analysis process below. The base data target feature diabetes consisted of three values:

(0,1,2): 0: no diabetes, 1: pre-diabetes, 2: diabetes

Data Analysis

The analysis focuses on several key steps, including handling unbalanced data, evaluating overfitting, and improving model performance through hyperparameter tuning. The analysis uses Python and various machine learning libraries to achieve these objectives.

Several pipelines were used to streamline the analysis.

data preparation pipeline: apply additional feature transformation, scaling and sampling methods to base data address issues found with unbalanced data and overfitting

This pipeline were used to run models and generate metrics for the following modified datasets:

#	feature	scaling	sampling	Dataset
1	diabetes 0/1/2	none	none	Base dataset
2.0	diabetes 0/1/2	StandardScaler	none	standard_scaled
2.1	diabetes 0/1/2	MinMaxScaler	none	minmax_scaled
3	diabetes 0/1	StandardScaler	none	ss_binary
4	diabetes 0/1	StandardScaler	RandomUnderSampler	sb_random_undersample
5	diabetes 0/1	StandardScaler	RandomOverSampler	sb_random_oversample
6	diabetes 0/1	StandardScaler	ClusterCentroids	sb_cluster
7	diabetes 0/1	StandardScaler	SMOTE	sb_smote
8	diabetes 0/1	StandardScaler	SMPOTEENN	sb_smoteenn

model execution pipeline: ran a series of 9 models collected metrics, displayed the metrics in the jupyter file and pushed them to a file.
- models included:
  - KNeighborsClassifier(n_neighbors=k_value), data)
  - tree.DecisionTreeClassifier(), data)
  - RandomForestClassifier(), data)
  - ExtraTreesClassifier(random_state=1), data)
  - GradientBoostingClassifier(random_state=1), data)
  - AdaBoostClassifier(random_state=1), data)
  - LogisticRegression(), data)

Evaluating Overfitting

All the models were greatly overfit with the base dataset.

All models were then run against each modified dataset. The metrics were prepared and archived in the brfss_2021/reports/ directory.

It was determined that overfitting occured in most cases. However, it was minimized by using the binary target feature, scaling with StandardScaler or MinMaxScaler and resampled using RandomOverSampling or RandomUnderSampling.

Imbalanced data

valuecount % of base data:

target value % description

0 84% No diabetes

1 2% Pre-diabetes

2 14% Diabetes
Valuecount % of binary data has:

target value % description

0 86% No diabetes

1 14% Diabetes

Using the following sampling methods improved the metric results:

RandomOverSampler
RandomUnderSampler
ClusterCentroids
SMOTE
SMOTEENN

RandomOverSampler & RandomUnderSampler performed as well as the others and had a better execution time. RandomUnderSampler provided the smallest dataset to train and fit. Therefore it was used in the optimization phase.

Metric Evaluation

The metrics for all the modified datasets and models are provided in the reports directory. The performance summary shows the performance of all model executions. The detailed reports are listed below:

The details of the 2015 dataset runs are contained in these file:
The details of the 2015 dataset runs are contained in these file:

Initial Results of the 7 models * 9 Permulations of the data (datasets) for 63 total runs.

We Sorted the top 20 accuracy results, top 20 Presion and top 20 F1 scores. The we performed an inner join on the result and the models that performed best from other three lists are:

Addressing imbalanced data with RandomOversampler or SMOTE worked best with the RandomForestClassifier, ExtraTreesClassifier, and GrandientBoostingClassier models.

However, the top models based on accuracy are: GradientBoostingClassifier, AdaBoostClassifier, and LogisticRegression with the binary data (0/1: no diabetes/diabetes) and standardScaler applied to the dataset.

Optimization / Hyperparameter tuning

Hyperparameter Tuning

Decision Tree Classifier + Randomized Search CV:
- We sampled a fixed number of parameter settings from specified ranges for efficiency
- The optimization helped but not a substantial amount on this dataset
- We sorted the highest F1 score, precision, and accuracy
- The results were these 4 data sets:
- The final parameters and scores reflect the optimized model's ability to predict diabetes with higher accuracy and reliability.

Conclusions - Project Goal Achievement?

Conclusion:

Conclusions from 63 Model/Dataset Runs for each year (126 total dataset/model combinations)
We achieved good accuracy; but because of imbalance struggled with Precision Optimization helped some, but did not make large gains for most models.

Top Models based on accuracy

GradientBoostingClassifier
AdaBoostClassifier
LogisticRegression

Top Datasets

Binary dataset with StandardScalar
Binary, Standard Scalar & SMOTEEN sampling.

Project Goal: Achieved

Successfully identified key factors contributing to diabetes prevalence.
Developed predictive models with significant accuracy and reliability.Strong Predictive performance through application of pipelines, optimized datasets, advanced classification models, model performance ranking, and model optimization.

Files and Directories

data cleaning

Intial data exploration and cleaning work. These two directories are related to the initial evaluation of all the BRFSS dataset for 2019 to 2022, reading of the codebooks, and evaluating and selecting features. The results were applied to the data_cleaning file listed below under files: Data Cleaning

data: : Data pulled for the initial feature analysis and data cleaning research
data_cleaning Notebooks and files for the initial feature analysis and data cleaning research

Analysis

Two full analysis were run: One for brfss 2015 and the other for brfss 2021 dataset. These analysis are self contained in the following directories

brfss_2015 2015 analysi and optimizations
brfss_2021 2021 analysis and optimizations

Documenation

docs Other descriptive markdown documents
imgs Images to support this readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diabetes Prediction Project

Diabetes Prediction from CDC Behavioral Risk Factor Surveillance System (BRFSS) Survey data

Team

Project Checklist:

Executive Summary

Project Overview

Project Details:

Ideation

Potential Datasets Evaluated

Project directory structure

Feature Engineering

Understanding Diabetes

Feature Selection

Data Cleaning

Data Analysis

Initial Results of the 7 models * 9 Permulations of the data (datasets) for 63 total runs.

Optimization / Hyperparameter tuning

Hyperparameter Tuning

Conclusions - Project Goal Achievement?

Files and Directories

data cleaning

Analysis

Documenation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
brfss_2015		brfss_2015
brfss_2021		brfss_2021
data		data
data_cleaning		data_cleaning
docs		docs
imgs		imgs
pkgs		pkgs
prototype_ava		prototype_ava
prototype_elia		prototype_elia
prototype_jeff		prototype_jeff
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
project-checklist.md		project-checklist.md

License

jflachman/diabetes_predictions

Folders and files

Latest commit

History

Repository files navigation

Diabetes Prediction Project

Diabetes Prediction from CDC Behavioral Risk Factor Surveillance System (BRFSS) Survey data

Team

Project Checklist:

Executive Summary

Project Overview

Project Details:

Ideation

Potential Datasets Evaluated

Project directory structure

Feature Engineering

Understanding Diabetes

Feature Selection

Data Cleaning

Data Analysis

Initial Results of the 7 models * 9 Permulations of the data (datasets) for 63 total runs.

Optimization / Hyperparameter tuning

Hyperparameter Tuning

Conclusions - Project Goal Achievement?

Files and Directories

data cleaning

Analysis

Documenation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages