HAI-Infections

Linear Regression Analysis on Nosocomial Infections using a dataset from an extract from the Study on the Efficacy of Nosocomial Infection Control (SENIC)

This was an assignment in my Stastical Computer Packages course at the George Washington University, and I learned a lot about feature selection using wrapper methods and machine learning techniques in Regression because of this project.

Credits to Vikashraj Luhaniwal from TowardDataScience.com for his article on Feature Selection Using Wrapper Methods for the feature selection code I use in this notebook. Also credits to Dr. Tirthajyoti Sarkar from TowardDataScience.com for his article on How to Check the Quality of a Regression Model with Python and the code I use from his github to do the residual analysis in this notebook. Links to these articles are posted below:

Code and Resources Used

Python Version: 3.6

Packages: pandas, numpy, matplotlib, seaborn, statsmodels

Feature Selection Using Wrapper Methods by Vikashraj Luhaniwal: https://towardsdatascience.com/feature-selection-using-wrapper-methods-in-python-f0d352b346f

How to Check the Quality of a Regression Model with Python by Dr. Tirthajyoti Sarkar: https://towardsdatascience.com/how-do-you-check-the-quality-of-your-regression-model-in-python-fa61759ff685

Data

The is from an extract from the Study on the Efficacy of Nosocomial Infection Control (SENIC). The variables are the following:

length of stay
age
infection risk
routine culturing
routine chest x-ray
num of beds
med school affiliation
avg daily census
num of nurses
available facilities & services

Data Cleaning

With our test data being patient ID's 1-5, these rows are dropped from the original dataset.

EDA

With Infection Risk as our Target Variable, I develop an understanding of the data with the following methods:

View descriptive statistics of data (means, standard deviations, etc.)
Visualize distributions and linearity via pairplot.
Visualize correlations between variables.
Further analyze variables with high correlations to target variable.

Descriptive Statistics

Correlations

Seeing that length of stay, routine culturing, routine chest x-ray, and available facilities & services have the highest correlation with infection risk, I then plotted their respective scatter plots against infection risk.

Sample Scatter Plots

Model Building

To build a Linear Regression Model with the test data, I used these three feature selection methods:

Forward Selection
Backward Elimination
Stepwise Selection

For the most part, all three methods selected "length of stay", "routine culturing", and "available facilities & services" as the top three features.

Sample Feature Selection Output

Model Analysis: Residual Analysis & Verifying Assumptions

To ensure we can trust our model, I had to verify the following assumptions about Linear Models:

Independence of predictors
Linearity with Target Variable
Homoscedasticity
Normally Distributed
No Multicollinearity

More of this is explored in the Jupyter Notebook, but overall, with the exception of No Multicollinearity (there is multicollinearity), all assumptions are satisfied.

Linearity & Independence

Homoscedasticity

Multicollinearity

Note that there is Multicollinearity since length of stay and available facilities & services have VIFs > 10.

Prediction Results

Our model, without further optmization, predicts the following:

Test Data

Predictions of Test Data

Confidence and Prediction Intervals

Optimization Ideas

Since the model has an adjusted R-squared of .471, it is obvious that the model does need more optimazation to become more accurate and useful. I recommend the following:

Sampling data which includes a factor that scores for the quality of sanitation a healthcare facility has, using criteria such as hand washing, presence of rodents, preventive measures against germ spread, use of gloves, etc.
Record or engineer with existing data the average of how many patients per room in a healthcare facility.
Delete outliers, of which there are few, from dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Images		Images
Code.ipynb		Code.ipynb
Code.py		Code.py
Infections.csv		Infections.csv
README.md		README.md
Report 4 Nosocomial Infections.docx		Report 4 Nosocomial Infections.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAI-Infections

Code and Resources Used

Data

Data Cleaning

EDA

Model Building

Model Analysis: Residual Analysis & Verifying Assumptions

Prediction Results

Optimization Ideas

About

Releases

Packages

Languages

MarcelinoV/HAI-Infections

Folders and files

Latest commit

History

Repository files navigation

HAI-Infections

Code and Resources Used

Data

Data Cleaning

EDA

Model Building

Model Analysis: Residual Analysis & Verifying Assumptions

Prediction Results

Optimization Ideas

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages