Skip to content

jflachman/diabetes_predictions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diabetes Prediction Project

Diabetes Prediction from CDC Behavioral Risk Factor Surveillance System (BRFSS) Survey data

Team

  • Jeff Flachman
  • Ava Lee
  • Elia Porter

Project Checklist:


Please see the project checklist for artifacts supporting the project objectives.


Executive Summary

This project aims to analyze the factors contributing to the prevalence of diabetes in the United State and determine if they provide some predictive value in determining a diagnosis of diabetes.

A dataset pulled from the 2015 BRFSS was available on the UC Irvine Machine Learning Repository. This dataset was cleaned from the 2015 BRFSS survey data. The team also pulled an cleaned data from the 2021 BRFSS survey

Project Overview

Diabetes is the eighth leading cause of death in the United Stages. But many rank it second behind heart disease as a chronic illness that leads to death. Diabetes also has a daily implact of those who live with it.

The team was interested in diabetes predictions using data from the The CDC Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an annual phone survey of 300K-400K respondents.

A dataset pulled from the 2015 BRFSS was available on the UC Irvine Machine Learning Repository. This dataset had already been cleaned from the 2015 BRFSS survey data. The 2015 dataset was older and already cleaned. Therefore, the team also pulled and cleaned the 2021 BRFSS survey.

Diabetes risk factors listed below were used to select survey results from BRFSS dataset. The 2015 dataset contained 21 features. The team selected 35 features from the 2021 dataset.

Two targets were evaluated:

  • target 1 (0,1,2): 0: no diabetes, 1: pre-diabetes, 2: diabetes
  • target 2 (binary:0,1): 0: no diabetes, 1: diabetes

Classification models were trained and the metrics were computed. In addition, alternate scaling and sampling techniques were used to handle the imbalance in the datasets.

In all, 63 configurations of binary/012, scaling & sampling method, and models were trained and the metrics were computed for each base dataset (2015 & 2021).

The metrics were evaluated and a few targeted dataset configurations were selected to optimize. These included: - binary target, standard scalar, randomUnderSample data with models: - LogisticRegression optimized with with RandomizedSearchCV - AdaBoost optimized with with RandomizedSearchCV

For more information, see the Details below.

Project Details:

Ideation

Potential Datasets Evaluated

The team brainstormed multiple dataset options for this project. Some of the datasets reviewed are listed in the datasets files. The team reviewed candidate datasets for abalone, mushroom, bike sharing, and diabetes.

The team was most interested in diabetes predictions using data from the The CDC Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is an annual phone survey of 300K-400K respondents.

As a starting point, a dataset pulled from the 2015 BRFSS was used from the UC Irvine Machine Learning Repository

Project directory structure

README.md - This file provides a description of the project

Directories:

Note: The project classification analysis and results of the CDC Behavioral Risk Factor Surveillance System (BRFSS) Survey data are contained in the brfss_2015 and brfss_2021 directories.

directory description
brfss_2015 Contains all the anlaysis of the 2015 BRFSS data. A description of the files in this directory can be found in the brfss_2015 README
brfss_2021 Contains all the anlaysis of the 2021 BRFSS data. A description of the files in this directory can be found in the brfss_2021 README
data_cleaning Contains all the work the read codebooks, refine features, transform and modify feature values. This work was then moved into the brfss_2015 and brfss_2021 directories as signle 1_....ipynb files which pulled and cleaned the data for each year
pkgs python files containing pipline code writting for data
docs other markdown files referenced by this README and other docs
imgs graphics included in documentation markdowns
prototype_ava Prototype code written by Ava Lee
prototype_elia Prototype code written by Elia Porter
prototype_jeff Prototype code written by Jeff Flachman

Note: Additonal READMEs are available in some subdirectories.

Feature Engineering

Understanding Diabetes


I order to understand the features, it is important to understand the risks and indicators of diabetes.

Risk Factors

There are many risk factors for developing type 2 diabetes, including:

  • Age: Being over 40 increases your risk.
  • Family history: Having a parent, sibling, or other relative with type 1 or type 2 diabetes increases your risk.
  • Ethnicity: People of certain races and ethnicities, including African Americans, Hispanics, American Indians, and Asian-Americans, are more likely to develop type 2 diabetes.
  • Inactivity: The less active you are, the greater your risk.
  • Weight: Being overweight or obese increases your risk. You can estimate your risk by measuring your waist circumference. Men have a higher risk if their waist circumference is more than 40 inches, while women who are not pregnant have a higher risk if their waist circumference is more than 35 inches.
  • Blood pressure: High blood pressure can lead to insulin resistance and eventually type 2 diabetes.
  • Cholesterol: High cholesterol can raise your risk for diabetes and heart disease.
  • Smoking: Smokers are more 30-40% more likely than non-smokers to develop type 2 diabetes.

Diabetes Indicators / Symptoms

Diabetes is a chronic condition that can be diagnosed by a medical professional. While it often has no symptoms, some indicators include:

  • Urination: Frequent urination, especially at night
  • Thirst: Excessive thirst
  • Hunger: Increased hunger, even when eating
  • Weight loss: Unintentional weight loss
  • Fatigue: Feeling more tired than usual
  • Vision: Blurred vision
  • Wounds: Cuts and bruises that take longer to heal
  • Skin: Itchy skin or genital itching
  • Infections: Urinary tract infections (UTIs) or yeast infections
  • Sensations: Unusual sensations like tingling, burning, or pricklin

Feature Selection

A list of features was pulled from the UCI/Kaggle documentation on the 2015 dataset. In addition, the 2021 codebook was imported and parsed. See the work in data_cleaning. The features in the codebook were evaluated, selected and a summare of the selected 2021 features was written to a file.

Key features relevant to diabetes analysis were selected. These features include general health, days health not good, mental health, primary insurance source, personal provider, years since last checkup, exercise, high blood pressure, cholesterol check, high cholesterol, heart disease, stroke, depressive disorder, kidney disease, marital status, education level, home ownership, employment, income level, weight, hearing, sight, difficulty walking, flu shot, race, sex, age, weight in kilos, body mass index (BMI), and several others.

Data Cleaning

A contributing factor to including 2021 data was that the features on 2015 data on UCI/Kaggle were already selected and cleaned. Therefore, the team put in a considerable effort to automate including other CDC BRFSS survy years and clean the data. Post 2015, 2021 had the most features related to the risk factors for diabetes. Thus 2021 was selected as the best year to clean. A list of years and features counts pulled from the CDC website is recorded in the CDC - BRFSS Datasets by year file.

The CDC also has a list of Diabetes indicators for machine learning

The files ml_clean_config.py and ml_clean_features.py contain the functions written to handle processing the codebooks, selecting features and cleaning the data.

Cleaning
The CDC BRFSS Survery data responses were already provided a numeric values. Therefore, Get_dummies, OneHotEncoder and OrdinalEncoder were not required. However, it was necessary to do some cleaning. Some responses were unknown or refused and those rows needed to be dropped. Other values needed to be scaled (i.e. weight of 4015 kg needed to be scaled to 40.15 kg). Finally, the numeric values for some responses needed to be transformed. i.e. for exercise, the value 88 (no days) was transformed to 0 days, where 1-30 was number of days of month of exercise.

Substantial time was spent productionizing (pipeline) the processing of the CDC Codebooks, simplifying feature extraction and feature cleaning and imputation. The files supporting processing the codebooks and cleaning the data can be found in the data_cleaning directory. Ultimately, the feature descriptions for the 2021 dataset were automatically generated into the following file: 2021 features. A dictionary based configuration file was used to define the operations to be made on each feature in the dataset and a function then performed imputation on all features in a single function call.

Finally, the 2021 data originally had 55 feature. A correlation matrix was plotted and the highest correlated feature to features were reviewed for potential duplication of information. For example, the dataset started with 5 Race feature and this was reduced to 1 Race feature. Several education features were reduced to 1. The feature reduction as well as the other feature engineering steps can be found in the 2021 Data Cleaning Notebook. The final 2021 feature set has 36 feature and the target (diabetes)

This cleaning process produced the base data used in the Data Analysis process below. The base data target feature diabetes consisted of three values:

  • (0,1,2): 0: no diabetes, 1: pre-diabetes, 2: diabetes

Data Analysis

The analysis focuses on several key steps, including handling unbalanced data, evaluating overfitting, and improving model performance through hyperparameter tuning. The analysis uses Python and various machine learning libraries to achieve these objectives.

Several pipelines were used to streamline the analysis.

  • data preparation pipeline: apply additional feature transformation, scaling and sampling methods to base data address issues found with unbalanced data and overfitting

    • This pipeline were used to run models and generate metrics for the following modified datasets:
      # feature scaling sampling Dataset
      1 diabetes 0/1/2 none none Base dataset
      2.0 diabetes 0/1/2 StandardScaler none standard_scaled
      2.1 diabetes 0/1/2 MinMaxScaler none minmax_scaled
      3 diabetes 0/1 StandardScaler none ss_binary
      4 diabetes 0/1 StandardScaler RandomUnderSampler sb_random_undersample
      5 diabetes 0/1 StandardScaler RandomOverSampler sb_random_oversample
      6 diabetes 0/1 StandardScaler ClusterCentroids sb_cluster
      7 diabetes 0/1 StandardScaler SMOTE sb_smote
      8 diabetes 0/1 StandardScaler SMPOTEENN sb_smoteenn
  • model execution pipeline: ran a series of 9 models collected metrics, displayed the metrics in the jupyter file and pushed them to a file.

    • models included:
      • KNeighborsClassifier(n_neighbors=k_value), data)
      • tree.DecisionTreeClassifier(), data)
      • RandomForestClassifier(), data)
      • ExtraTreesClassifier(random_state=1), data)
      • GradientBoostingClassifier(random_state=1), data)
      • AdaBoostClassifier(random_state=1), data)
      • LogisticRegression(), data)

Evaluating Overfitting

All the models were greatly overfit with the base dataset.

All models were then run against each modified dataset. The metrics were prepared and archived in the brfss_2021/reports/ directory.

It was determined that overfitting occured in most cases. However, it was minimized by using the binary target feature, scaling with StandardScaler or MinMaxScaler and resampled using RandomOverSampling or RandomUnderSampling.

Imbalanced data

  • valuecount % of base data:

    target value % description
    0 84% No diabetes
    1 2% Pre-diabetes
    2 14% Diabetes
  • Valuecount % of binary data has:

    target value % description
    0 86% No diabetes
    1 14% Diabetes

Using the following sampling methods improved the metric results:

  • RandomOverSampler
  • RandomUnderSampler
  • ClusterCentroids
  • SMOTE
  • SMOTEENN

RandomOverSampler & RandomUnderSampler performed as well as the others and had a better execution time. RandomUnderSampler provided the smallest dataset to train and fit. Therefore it was used in the optimization phase.

Metric Evaluation

The metrics for all the modified datasets and models are provided in the reports directory. The performance summary shows the performance of all model executions. The detailed reports are listed below:

Initial Results of the 7 models * 9 Permulations of the data (datasets) for 63 total runs.

  • We Sorted the top 20 accuracy results, top 20 Presion and top 20 F1 scores. The we performed an inner join on the result and the models that performed best from other three lists are:

2021 Best Models

Addressing imbalanced data with RandomOversampler or SMOTE worked best with the RandomForestClassifier, ExtraTreesClassifier, and GrandientBoostingClassier models.

However, the top models based on accuracy are: GradientBoostingClassifier, AdaBoostClassifier, and LogisticRegression with the binary data (0/1: no diabetes/diabetes) and standardScaler applied to the dataset.

Optimization / Hyperparameter tuning

Hyperparameter Tuning

  1. Decision Tree Classifier + Randomized Search CV:
    • We sampled a fixed number of parameter settings from specified ranges for efficiency

    • The optimization helped but not a substantial amount on this dataset

    • We sorted the highest F1 score, precision, and accuracy

    • The results were these 4 data sets:

    • The final parameters and scores reflect the optimized model's ability to predict diabetes with higher accuracy and reliability.

Conclusions - Project Goal Achievement?

Conclusion:

  • Conclusions from 63 Model/Dataset Runs for each year (126 total dataset/model combinations)
  • We achieved good accuracy; but because of imbalance struggled with Precision Optimization helped some, but did not make large gains for most models.

Top Models based on accuracy

  • GradientBoostingClassifier
  • AdaBoostClassifier
  • LogisticRegression

Top Datasets

  • Binary dataset with StandardScalar
  • Binary, Standard Scalar & SMOTEEN sampling.

Project Goal: Achieved

  • Successfully identified key factors contributing to diabetes prevalence.
  • Developed predictive models with significant accuracy and reliability.Strong Predictive performance through application of pipelines, optimized datasets, advanced classification models, model performance ranking, and model optimization.





Files and Directories

data cleaning

Intial data exploration and cleaning work. These two directories are related to the initial evaluation of all the BRFSS dataset for 2019 to 2022, reading of the codebooks, and evaluating and selecting features. The results were applied to the data_cleaning file listed below under files: Data Cleaning

  • data: : Data pulled for the initial feature analysis and data cleaning research
  • data_cleaning Notebooks and files for the initial feature analysis and data cleaning research

Analysis

Two full analysis were run: One for brfss 2015 and the other for brfss 2021 dataset. These analysis are self contained in the following directories

  • brfss_2015 2015 analysi and optimizations
  • brfss_2021 2021 analysis and optimizations

Documenation

  • docs Other descriptive markdown documents
  • imgs Images to support this readme

About

DU AI ML project #2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •