In this collaborative coding project, we aim to develop accurate stroke prediction models using machine learning techniques. Our dataset encompasses various essential features, including age, gender, BMI, average glucose level, work type, and smoking status. To achieve reliable predictions, we perform data preprocessing, outlier detection, feature selection, and model training. Through this project, we showcase the practical application of machine learning techniques in stroke prediction, providing valuable insights for early detection and prevention. Our ultimate goal is to accurately identify individuals at risk, contributing to improved healthcare outcomes.
First you need to clone the repository:
git clone https://github.com/ain2002-project/ain2002-project
cd ain2002-project
The codes for this project has been developed and tested on Python version 3.7.12
. We have added a .python-version
file to the repository to ensure that the correct version of Python is used. We recommend using pyenv
to manage your Python versions. If the python 3.7.12
is installed in your system, you can skip the following steps.
Install python 3.7.12
using pyenv
:
pyenv install 3.7.12
We should create a virtual environment so that the packages installed for this project do not interfere with the packages installed in the system. To create a virtual environment, run the following command in the root directory of the repository:
python -m venv .venv
.venv/bin/activate
To install the required packages, run the following command in the root directory of the repository:
pip install -r requirements.txt
You can also run the notebook on the Kaggle. The notebook is available here.
To run the codes on the Kaggle, you need to add this competition dataset by kaggle and this dataset by fedesoriano in the data section in the right panel. We have already uploaded these datasets to our kaggle notebook.
If you want to download the data locally, you can download the datasets from kaggle by hand or you can use these commands (this will require you to be authenticated):
kaggle competitions download -c playground-series-s3e2
kaggle datasets download -d fedesoriano/stroke-prediction-dataset
And unzip them:
unzip stroke-prediction-dataset.zip -d data
unzip playground-series-s3e2.zip -d data/playground-series-s3e2
You can run the notebook on any jupyter server (vscode, jupyterlab, by jupyter notebook command, etc.)). If you are using the local environment, you can run the notebook by running the following command in the root directory of the repository:
jupyter notebook
You can run the codes as python files. They are essentially same with the notebook but with less output and no plots. You can run the codes by running the following command in the root directory of the repository:
python train.py
This will train and save 3 models that can be used in evaluation and inference.
The evaluation script runs evaluation metrics on the validation dataset and makes inference on the competition dataset. You can run it by
python evaluate.py
And if you want to see the submission score on kaggle, you can run the following command to upload the submission file:
kaggle competitions submit -c playground-series-s3e2 -f submission.csv -m "Message"
If everything goes well, you should get a 0.89624
private score.
Pretrained models will be generated and saved in models
folder. Also we have shared the models folder in a github release.