Stroke is the second leading cause of death globally. This repository contains the code and documentation for a educational research project on stroke risk prediction. The project investigates the escalating incidence of strokes, leveraging a Kaggle-sourced dataset with 11 clinical features.
The dataset undergoes data preprocessing, exploratory data analysis, and the development of logistic regression and XGBoost models.
-
Exploratory Data Analysis (EDA) The EDA process involves visualization tools such as R and ggplot2 to analyze stroke events across demographic and lifestyle factors. Unexpected findings, such as the relationship between glucose levels and strokes, are explored.
-
Imbalanced Data Handling Acknowledging imbalanced data, the project addresses this bias through oversampling using the MWMOTE technique.
-
Logistic Regression and Variable Importance Logistic regression models are developed, and Variable Importance in Projection (VIP) analysis is employed to identify key factors influencing stroke occurrence.
-
AIC Analysis The project compares two logistic regression models using the Akaike Information Criterion (AIC) for model selection.
-
Performance Assessment Performance evaluation includes confusion matrices, ROC curves, and other metrics for both logistic regression and XGBoost models.
Data Imbalance
To address class imbalance in the dataset was used the Majority Weighted Minority Oversampling Technique (MWMOTE). This technique oversamples the minority class, focusing on instances with fewer neighbors. It assigns weights to majority class instances based on their proximity to the minority class, guiding the generation of synthetic samples to balance the class distribution. This makes the model more capable of learning challenging instances.
- The Logistic Regression Model was created by using the ‘glm’ method and ‘binomial’ family.
- The other model is XGBoost, it is a model from the tree-based family.
Metric | Logistic Regression Model | XGBoost Model |
---|---|---|
Accuracy | 81.13% | 96.15% |
Sensitivity (True Positive Rate) | 85.12% | 94.97% |
Specificity (True Negative Rate) | 77.15% | 97.33% |
Kappa Statistic | 0.6226 | 0.923 |
Positive Predictive Value (PPV) | 78.83% | 97.27% |
Negative Predictive Value (NPV) | 83.83% | 95.09% |
Balanced Accuracy | 81.13% | 96.15% |
src/
: Contains the script for data preprocessing, model development, and evaluation.data/
: Contains the Kaggle-sourced dataset.plots/
: Includes visualizations generated during the exploratory data analysis.report/
: Contains the presentation and the scientific paper of the project.
- RStudio
- R version 4.3.2
- Run this command in RStudio console to install the required libraries:
source("./src/install_libraries.R")
- Clone the repository:
git clone https://github.com/Kaito999/stroke-risk-prediction.git
- Navigate to the project directory:
cd stroke-risk-prediction
- Open the project:
stroke-risk-prediction/stroke-risk-prediction.Rproj
- Access the code:
src/stroke_risk_prediction.R
- Execute step by step the code inside the script
This project is licensed under the [MIT] - see the LICENSE file for details.