Files
readm
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
This repository accompanies the manuscript "Development and Validation of Machine Learning Models to Predict Readmission after Colorectal Surgery" submitted to Journal of GI Surgery and contains the code which can be used to reproduce work. The colectomy and proctectomy procedure targeted datasets were downloaded from the participant use data file website (https://www.facs.org/quality-programs/acs-nsqip/participant-use). Microsoft Excel was used to convert TXT files to CSV files. Further data processing was then performed using the Pandas library in Python. Data missing values for readmission were dropped. A BMI column was generated from height and weight. Patients undergoing ostomy placement were identified using CPT codes (44211, 44212, 45113, 45119, 44155, 44157, 44158, 44125, 44187, 44141, 44143, 44144, 44146, 44150, 44151, 44206, 44208, 44210, 44187, 44188, 44320, 44310). The ‘COL_APPROACH’ column was condensed, with SILS, endoscopic, NOTES, ‘other MIS’, and hybrid cases recoded as laparoscopic. Procedures were categorized based on CPT codes to L, R, and total colectomy and LAR, APR, and proctectomy with perineal approach. A race/ethnicity column was generated by combining the race and ethnicity columns. Missing categorical values were filled with the string ‘Unknown.’ Missing numerical values were filled with the median value of the column. The numerical columns were scaled using RobustScaler. The categorical columns were encoded using LabelEncoder. RandomSearchCV was used to identify the best hyperparameters for each model. RF and XGB combinations were tested for 100 iterations with 5-fold cross-validation on the test/validation data. NN combinations were tested for 50 iterations with 5-fold cross-validation. NN models consisted of a series of fully-connected layers, with Dense layers with “relu” activation, followed by Batch Normalization and Dropout. The Adam optimizer and binary crossentropy loss were used. Hyperparameter search showed a 2 layer model, with 1000 nodes each followed by 1 output node, with 80% dropout and a learning rate of 3 x 10-3 had the best performance. Training was performed with early stopping with a patience of 25 epochs and a minimum change of 1x10-8. The Delong test was implemented using code from https://biasedml.com/roc-comparison/. The notebooks in the combine_puf folder can be used to combine the colectomy and proctectomy datasets. Once the combined csv is created, it can be pre-processed using 'preproc.ipynb'. 'table1.ipynb' can be used to generate summary statistics. Scripts in the hyperparameter_search folder can be used to find optimal hyperparameters for each model. Then these parameters can be inputted into 'all_models.ipynb' and metrics calculated. These notebooks also produces TPR/FPR's and precision/recall's to be used in 'curves.ipynb'. Finally, 'shap.ipynb' can be used to build a NN model and perform SHAP analysis.