-
Notifications
You must be signed in to change notification settings - Fork 0
/
BRFSS-regression.py
286 lines (192 loc) · 12.1 KB
/
BRFSS-regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
# -*- coding: utf-8 -*-
"""Complete-your-supervised-learning-project.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1XQ6gqcGvQuSRfWwHkPy35aDJfaw03Ny2
**Welcome to my experiment on the Behavorial Risk Factor Surveillance System (BRFSS) Prevalence Data dataset!**
In this exploration, we will be determining factors for regression that may be useful in predicting data that is relevant and accurate to this system. Background information on the BRFSS is provided below, including links that may be helpful to reference for understanding of the individual facets of the data being analyzed.
**Our Question Today:** How impactful are less-than-medical factors on this study? In particular, can we use *Statistical* factors to further utilize the inferences made from this survey.
**Behavioral Risk Factor Surveillance System (BRFSS) Prevalence Data (2011 to present)**
BRFSS combined land line and cell phone prevalence data. BRFSS is a continuous, state-based surveillance system that collects information about modifiable risk factors for chronic diseases and other leading causes of death. Data will be updated annually as it becomes available.
Detailed information on sampling methodology and quality assurance can be found on the BRFSS website (http://www.cdc.gov/brfss).
*Methodology:* http://www.cdc.gov/brfss/factsheets/pdf/DBS_BRFSS_survey.pdf
*Glossary:* https://chronicdata.cdc.gov/Behavioral-Risk-Factors/Behavioral-Risk-Factor-Surveillance-System-BRFSS-H/iuq5-y9ct/data
**Source:** https://catalog.data.gov/dataset/behavioral-risk-factor-surveillance-system-brfss-prevalence-data-2011-to-present
"""
# Commented out IPython magic to ensure Python compatibility.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, confusion_matrix, precision_score, recall_score
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
# %matplotlib inline
from sklearn import ensemble
import warnings
warnings.filterwarnings('ignore')
BRFSS_df = pd.read_csv("https://data.cdc.gov/api/views/dttw-5yxu/rows.csv?accessType=DOWNLOAD")
"""**Exploratory Data Analysis**
---
"""
BRFSS_df.info()
BRFSS_df.head(20)
non_numeric_columns = BRFSS_df.select_dtypes(['object']).columns
print(non_numeric_columns)
print("The number of non-numerical columns is {}".format(len(non_numeric_columns)))
numeric_columns = BRFSS_df.select_dtypes(['int64', 'float64']).columns
print(numeric_columns)
print("The number of numerical columns is {}".format(len(numeric_columns)))
total_missing = BRFSS_df.isnull().sum().sort_values(ascending=False)
percent_missing = (BRFSS_df.isnull().sum()/BRFSS_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_missing, percent_missing], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
plt.figure(figsize=(25,5))
plt.hist(BRFSS_df.Topic, edgecolor='black', linewidth=1.0)
plt.title("The distribution of data by Topic")
plt.xlabel("Topic")
plt.ylabel("number of occurrences per Topic")
plt.xticks(rotation=45, ha='right')
plt.show()
# Question & Response data coordinated:
## The idea here was to create a visualization of the Question and Response data together
## on the same chart(s). Unfortunately, I was unable to effectively show the frequency of
## each response value for each question, so the rest of this code has been excluded.
#plt.figure(figsize=(30,50))
#for index, column in enumerate(non_numeric_columns):
# plt.subplot(11,4,index+1)
# plt.bar(BRFSS_df.groupby(column)["Response"], BRFSS_df.groupby(column)["Response"].count(), color=("grey","green"))
# plt.title("Response wrt. {}".format(column))
# plt.ylabel("Frequency of Response")
# plt.xlabel(column)
# plt.xticks(rotation='vertical')
#plt.tight_layout()
#plt.show()
# ____________________________________________________________________ #
plt.figure(figsize=(30,40))
plt.scatter(BRFSS_df.Question, BRFSS_df.Response)
plt.title("The distribution of Question Responses")
plt.xlabel("Question")
plt.ylabel("Response")
plt.xticks(rotation=45, ha='right')
plt.grid()
## zip joins x and y coordinates in pairs
#for x,y in zip(BRFSS_df.Question, BRFSS_df.Response):
# label = f"({y.count(y)})"
# plt.annotate(label, # this is the text
# (x,y), # these are the coordinates to position the label
# textcoords="offset points", # how to position the text
# xytext=(0,10), # distance from text to points (x,y)
# ha='center') # horizontal alignment can be left, right or center
plt.show()
BRFSS_df.describe()
np.abs(BRFSS_df[numeric_columns].corr().loc[:,"Data_value"]).sort_values(ascending=False)
# deriving correlation coefficients between the features and the target variable
BRFSS_df['Data_value'].head(25)
cleaned_BRFSS_df = BRFSS_df.fillna(0)
"""**Ordinary Least Squares (OLS) Regression**
---
"""
# Ordinary Least Squares (OLS) Regression:
X = cleaned_BRFSS_df[numeric_columns]
Y = cleaned_BRFSS_df.Data_value
X = sm.add_constant(X)
results = sm.OLS(Y, X).fit()
results.summary()
# Making predictions
y_preds = results.predict(X)
plt.scatter(Y, y_preds)
plt.plot(Y, Y, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("Charges: true and predicted values")
plt.show()
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(Y, y_preds)))
print("Mean squared error of the prediction is: {}".format(mse(Y, y_preds)))
print("Root mean squared error of the prediction is: {}".format(rmse(Y, y_preds)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((Y - y_preds) / Y)) * 100))
# Split the data into train and test sets (20% in test):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)
train_test_results = sm.OLS(y_train, X_train).fit()
train_test_results.summary()
# Making predictions
y_preds = train_test_results.predict(X_test)
plt.scatter(y_test, y_preds)
plt.plot(y_test, y_test, color="red")
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.title("Charges: true and predicted values")
plt.show()
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds) / y_test)) * 100))
"""**Employing a Diverse Portfolio of Models of Linear Regression:**
---
"""
# Building several linear regression models (lasso, ridge, elastic net)
# and training them in the training set:
alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]
lrm = LinearRegression()
lrm.fit(X_train, y_train)
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)
print("R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
# Lasso:
lasso_cv = LassoCV(alphas=alphas, cv=5)
lasso_cv.fit(X_train, y_train)
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)
print("Best alpha value is: {}".format(lasso_cv.alpha_))
print("R-squared of the model in training set is: {}".format(lasso_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
# Ridge:
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)
y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)
print("Best alpha value is: {}".format(ridge_cv.alpha_))
print("R-squared of the model in training set is: {}".format(ridge_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(ridge_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
# Elastic Net:
elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)
elasticnet_cv.fit(X_train, y_train)
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)
print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print("R-squared of the model in training set is: {}".format(elasticnet_cv.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticnet_cv.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))
"""**Results: Evaluating the Models**
---
After analyzing the data by using a variety of regression models to make predictions, the following conclusions appear to be accurate:
* It is difficult to choose a model that performed best, considering a couple contradicting outcomes.
* In particular, all of the models used yield a value for R-squared that is extremely close to 1; in fact, half of the models yield a value exactly equal to 1 for their R-squared metric.
* Although a high value close to 1 is an ideal outcome for R-squared, it is also true that every model in this notebook also yielded an *infinite* Mean Absolute Percentage Error of their prediction. In effect, it is not reasonable to accept the high values for R-squared as valid indicators of regression models that best fit the data.
* However, the regression model using Ordinary Least Squares yielded extremely low (and even *negative*) scores for AIC and BIC before *and* after splitting the data. As a result, that method is likely the best-performing option of all the models chosen for this experiment.
* Specifically, the OLS regression model formed *before* splitting the data with train and test sets is likely the best-performing model, as it exhibits lower AIC and BIC scores than the same method produced *after* splitting the data.
* It is also possible that the majority of metrics of accuracy having values close to 1 may be due to the combination of the rather large size of the DataFrame and the rather small amount of missing data in the original dataset.
* Additionally, it is possible that the method of cleaning the data (filling missing and Null data with a value of zero) may not have been the most effective method for preparing the data to be analyzed. Consequently, the models utilized may be prone to overfitting and a lack of regularization.
Thank you very much for your time and attention throughout this process!
"""