Monkeypox, a viral disease, has emerged as a significant health concern, often requiring timely diagnosis to prevent severe outcomes. This project leverages machine learning to develop a predictive model for rapid and accurate detection of monkeypox. Using a synthetic dataset based on studies published by the British Medical Association (BMJ), the project aims to create a reliable classification model for identifying positive and negative cases.
- Objective: Build a machine learning model to predict Monkeypox infections with high precision, minimizing false positives.
- Dataset: Synthetic dataset of 25,000 global patients with 10 features and a binary target variable (MonkeyPox).
- Outcome: Logistic Regression achieved the highest precision score of 0.684, outperforming models like K-NN, Decision Trees, and Neural Networks.
The dataset contains 25,000 patient records with the following features: Features: Rectal Pain, Sore Throat, Penile Oedema, Oral Lesions, Solitary Lesion, Swollen Tonsils, HIV Infection, Sexually Transmitted Infection Target Variable: MonkeyPox (Binary: 1 = Positive, 0 = Negative)
Languages: Python Libraries: Data Manipulation: Pandas, NumPy Model Training: Scikit-learn Visualization: Matplotlib Machine Learning Models: Logistic Regression (Best Model) K-Nearest Neighbors (with and without k-fold) Decision Tree AdaBoost Gradient Descent Neural Networks Data Preprocessing: OrdinalEncoder, LabelEncoder StandardScaler GridSearchCV for Hyperparameter Tuning
- Import Libraries: Load essential libraries for data processing and modeling.
- Read Dataset: Explore and preprocess the dataset for training and testing.
- Clean Data: Handle missing values and encode categorical features.
- Split Dataset: Divide data into training and validation sets.
- Train Models: Train multiple models and tune hyperparameters.
- Evaluate Models: Use metrics like precision, recall, and F1 score to assess model performance.
- Select Best Model: Optimize and deploy the best-performing model (Logistic Regression).
Evaluation Metric: Precision (focus on minimizing false positives). Best Model: Logistic Regression Precision: 0.684 Key Advantage: Effective in reducing false positives, minimizing unnecessary treatments and associated costs.
- Early Disease Detection: Enables healthcare professionals to act swiftly, ensuring timely treatment.
- Cost Efficiency: Reduces unnecessary tests and medical expenses caused by false positives.
- Public Health Preparedness: Improves resource allocation and planning for outbreaks.
- Enhanced Outcomes: Contributes to better management and patient care.
Incorporate real-world datasets to validate the model further. Explore ensemble techniques to improve precision and recall. Optimize for imbalanced datasets using advanced sampling techniques. Deploy the model as a web application or API for practical healthcare use.