BRFSS-regression.py

# -*- coding: utf-8 -*-
"""Complete-your-supervised-learning-project.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1XQ6gqcGvQuSRfWwHkPy35aDJfaw03Ny2

**Welcome to my experiment on the Behavorial Risk Factor Surveillance System (BRFSS) Prevalence Data dataset!**

In this exploration, we will be determining factors for regression that may be useful in predicting data that is relevant and accurate to this system. Background information on the BRFSS is provided below, including links that may be helpful to reference for understanding of the individual facets of the data being analyzed.

**Our Question Today:** How impactful are less-than-medical factors on this study? In particular, can we use *Statistical* factors to further utilize the inferences made from this survey.

**Behavioral Risk Factor Surveillance System (BRFSS) Prevalence Data (2011 to present)**

BRFSS combined land line and cell phone prevalence data. BRFSS is a continuous, state-based surveillance system that collects information about modifiable risk factors for chronic diseases and other leading causes of death. Data will be updated annually as it becomes available. 

Detailed information on sampling methodology and quality assurance can be found on the BRFSS website (http://www.cdc.gov/brfss). 

*Methodology:* http://www.cdc.gov/brfss/factsheets/pdf/DBS_BRFSS_survey.pdf 

*Glossary:* https://chronicdata.cdc.gov/Behavioral-Risk-Factors/Behavioral-Risk-Factor-Surveillance-System-BRFSS-H/iuq5-y9ct/data

**Source:** https://catalog.data.gov/dataset/behavioral-risk-factor-surveillance-system-brfss-prevalence-data-2011-to-present
"""

# Commented out IPython magic to ensure Python compatibility.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, confusion_matrix, precision_score, recall_score
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV

# %matplotlib inline
from sklearn import ensemble

import warnings
warnings.filterwarnings('ignore')

BRFSS_df = pd.read_csv("https://data.cdc.gov/api/views/dttw-5yxu/rows.csv?accessType=DOWNLOAD")

"""**Exploratory Data Analysis**

---


"""

BRFSS_df.info()
BRFSS_df.head(20)

non_numeric_columns = BRFSS_df.select_dtypes(['object']).columns
print(non_numeric_columns)
print("The number of non-numerical columns is {}".format(len(non_numeric_columns)))

numeric_columns = BRFSS_df.select_dtypes(['int64', 'float64']).columns
print(numeric_columns)
print("The number of numerical columns is {}".format(len(numeric_columns)))

total_missing = BRFSS_df.isnull().sum().sort_values(ascending=False)
percent_missing = (BRFSS_df.isnull().sum()/BRFSS_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_missing, percent_missing], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

plt.figure(figsize=(25,5))

plt.hist(BRFSS_df.Topic, edgecolor='black', linewidth=1.0)
plt.title("The distribution of data by Topic")
plt.xlabel("Topic")
plt.ylabel("number of occurrences per Topic")
plt.xticks(rotation=45, ha='right')
plt.show()

# Question & Response data coordinated:
    ## The idea here was to create a visualization of the Question and Response data together
    ## on the same chart(s). Unfortunately, I was unable to effectively show the frequency of
    ## each response value for each question, so the rest of this code has been excluded.

#plt.figure(figsize=(30,50))

#for index, column in enumerate(non_numeric_columns):
#    plt.subplot(11,4,index+1)
#    plt.bar(BRFSS_df.groupby(column)["Response"], BRFSS_df.groupby(column)["Response"].count(), color=("grey","green"))
#    plt.title("Response wrt. {}".format(column))
#    plt.ylabel("Frequency of Response")
#    plt.xlabel(column)
#    plt.xticks(rotation='vertical')

#plt.tight_layout()
#plt.show()

# ____________________________________________________________________ #

plt.figure(figsize=(30,40))

plt.scatter(BRFSS_df.Question, BRFSS_df.Response)
plt.title("The distribution of Question Responses")
plt.xlabel("Question")
plt.ylabel("Response")
plt.xticks(rotation=45, ha='right')
plt.grid()

## zip joins x and y coordinates in pairs

#for x,y in zip(BRFSS_df.Question, BRFSS_df.Response):

#   label = f"({y.count(y)})"

#   plt.annotate(label, # this is the text
#                (x,y), # these are the coordinates to position the label
#                textcoords="offset points", # how to position the text
#                xytext=(0,10), # distance from text to points (x,y)
#                ha='center') # horizontal alignment can be left, right or center

plt.show()

BRFSS_df.describe()

np.abs(BRFSS_df[numeric_columns].corr().loc[:,"Data_value"]).sort_values(ascending=False)
# deriving correlation coefficients between the features and the target variable

BRFSS_df['Data_value'].head(25)

cleaned_BRFSS_df = BRFSS_df.fillna(0)

"""**Ordinary Least Squares (OLS) Regression**

---


"""

# Ordinary Least Squares (OLS) Regression:

X = cleaned_BRFSS_df[numeric_columns]
Y = cleaned_BRFSS_df.Data_value

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

# Making predictions
y_preds = results.predict(X)

plt.scatter(Y, y_preds)
plt.plot(Y, Y, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("Charges: true and predicted values")
plt.show()

print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(Y, y_preds)))
print("Mean squared error of the prediction is: {}".format(mse(Y, y_preds)))
print("Root mean squared error of the prediction is: {}".format(rmse(Y, y_preds)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((Y - y_preds) / Y)) * 100))

# Split the data into train and test sets (20% in test):

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

train_test_results = sm.OLS(y_train, X_train).fit()

train_test_results.summary()

# Making predictions
y_preds = train_test_results.predict(X_test)

plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("Charges: true and predicted values")
plt.show()

print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))

"""**Employing a Diverse Portfolio of Models of Linear Regression:**

---


"""

# Building several linear regression models (lasso, ridge, elastic net) 
# and training them in the training set:

alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

lrm = LinearRegression()

lrm.fit(X_train, y_train)

y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

# Lasso:

lasso_cv = LassoCV(alphas=alphas, cv=5)

lasso_cv.fit(X_train, y_train)

y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

print("Best alpha value is: {}".format(lasso_cv.alpha_))
print("R-squared of the model in training set is: {}".format(lasso_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

# Ridge:

ridge_cv = RidgeCV(alphas=alphas, cv=5)

ridge_cv.fit(X_train, y_train)

y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)

print("Best alpha value is: {}".format(ridge_cv.alpha_))
print("R-squared of the model in training set is: {}".format(ridge_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(ridge_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

# Elastic Net:

elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)

elasticnet_cv.fit(X_train, y_train)

y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

"""**Results: Evaluating the Models**

---

After analyzing the data by using a variety of regression models to make predictions, the following conclusions appear to be accurate:


*   It is difficult to choose a model that performed best, considering a couple contradicting outcomes.
  * In particular, all of the models used yield a value for R-squared that is extremely close to 1; in fact, half of the models yield a value exactly equal to 1 for their R-squared metric.
  * Although a high value close to 1 is an ideal outcome for R-squared, it is also true that every model in this notebook also yielded an *infinite* Mean Absolute Percentage Error of their prediction. In effect, it is not reasonable to accept the high values for R-squared as valid indicators of regression models that best fit the data.
*   However, the regression model using Ordinary Least Squares yielded extremely low (and even *negative*) scores for AIC and BIC before *and* after splitting the data. As a result, that method is likely the best-performing option of all the models chosen for this experiment.
  * Specifically, the OLS regression model formed *before* splitting the data with train and test sets is likely the best-performing model, as it exhibits lower AIC and BIC scores than the same method produced *after* splitting the data.

* It is also possible that the majority of metrics of accuracy having values close to 1 may be due to the combination of the rather large size of the DataFrame and the rather small amount of missing data in the original dataset.
  * Additionally, it is possible that the method of cleaning the data (filling missing and Null data with a value of zero) may not have been the most effective method for preparing the data to be analyzed. Consequently, the models utilized may be prone to overfitting and a lack of regularization.

Thank you very much for your time and attention throughout this process!


"""