- 1. Introduction
- 2. The Journey: A Deep Dive into Machine Learning Techniques
- 3. Project Structure
- 4. Methodology
- 5. Unveiling the Results
- 6. Conclusion
- 7. Next Steps
- 8. How to Use this Project
- 9. Author
This repository documents a series of machine learning projects undertaken for Data Money, a fictional data science consultancy. These projects explore the power of various machine learning algorithms, aiming to provide insights and solutions for diverse business challenges. We delve into the worlds of customer satisfaction prediction, song popularity forecasting, and wine preference clustering, showcasing how data-driven solutions can be effectively applied across different industries.
Our exploration of machine learning is presented through three distinct projects, each addressing a unique business problem. Each project represents a journey into the world of machine learning, demonstrating the following key aspects:
- Hands-on Experience: We gain practical experience by implementing and evaluating a range of classification, regression, and clustering algorithms.
- Performance Comparison: By rigorously comparing algorithms and analyzing their performance, we gain valuable insights into which methods are best suited for specific tasks.
- Actionable Insights: We interpret model results and extract key insights that can inform strategic decision-making, transforming data into valuable business intelligence.
This repository contains three Jupyter Notebooks, each representing a unique Data Money project:
classification.ipynb
: Predicting Customer Satisfaction - We partner with a leading airline to develop a model that predicts customer satisfaction based on travel experiences, enabling proactive service improvements and personalized offerings.regression.ipynb
: Forecasting Song Popularity - We collaborate with a major music streaming platform to build a model that predicts song popularity based on audio features, empowering artists and labels to make strategic decisions about production, marketing, and playlist curation.clustering.ipynb
: Unveiling Wine Preferences - We team up with a renowned wine producer to segment their portfolio and identify key characteristics that drive consumer preferences, guiding product development and marketing strategies.
The repository is structured as follows:
.
├── Data
│ ├── Classification/
│ ├── Clustering/
│ └── Regression/
├── Notebooks
│ ├── Classification.ipynb
│ ├── Clustering.ipynb
│ └── Regression.ipynb
├── README.md
├── Results/
└── requirements.txt
Our approach to each project follows a structured methodology:
- Data Preparation: We carefully prepare the data, including cleaning, feature engineering, and splitting into training, validation, and test sets.
- Algorithm Selection: We select a range of relevant algorithms based on the nature of the problem and desired outcomes.
- Model Training and Tuning: We train and fine-tune the selected algorithms using default parameters, followed by systematic hyperparameter optimization to achieve optimal performance.
- Performance Evaluation: We rigorously evaluate the performance of the models on the training, validation, and test sets using appropriate metrics for each task.
- Insights Extraction: We analyze the model results and derive key insights that can be used to inform business strategies and improve client outcomes.
Our machine learning experiments generated a wealth of insightful results, which we present in a series of tables, showcasing the performance of each algorithm across various tasks.
On the Training data
Algorithm | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
KNN | 0.9570 | 0.9576 | 0.9570 | 0.9569 |
Decision Tree | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Random Forest | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Logistic Regression | 0.8753 | 0.8751 | 0.8753 | 0.8749 |
XGBoost | 0.9781 | 0.9782 | 0.9781 | 0.9781 |
On the Validation data
Algorithm | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
KNN | 0.9279 | 0.9296 | 0.9279 | 0.9274 |
Decision Tree | 0.9507 | 0.9509 | 0.9507 | 0.9506 |
Random Forest | 0.9446 | 0.9447 | 0.9446 | 0.9444 |
Logistic Regression | 0.8756 | 0.8759 | 0.8310 | 0.8527 |
XGBoost | 0.9620 | 0.9709 | 0.9406 | 0.9555 |
On the Test data
Algorithm | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
KNN | 0.9294 | 0.9310 | 0.9294 | 0.9290 |
Decision Tree | 0.9554 | 0.9555 | 0.9554 | 0.9553 |
Random Forest | 0.9623 | 0.9624 | 0.9623 | 0.9622 |
Logistic Regression | 0.8712 | 0.8710 | 0.8712 | 0.8709 |
XGBoost | 0.9626 | 0.9628 | 0.9626 | 0.9625 |
Our classification experiments yielded compelling results, with XGBoost emerging as the top performer, achieving an impressive accuracy of 96.3% on the test data. This signifies its exceptional ability to accurately predict customer satisfaction based on the features analyzed.
On the Training data
Algorithm | R² | RMSE | MAE | MAPE | MPE |
---|---|---|---|---|---|
Linear Regression | 0.0461 | 21.3541 | 16.9982 | 8.6532 | -842.08 |
Decision Tree | 0.9918 | 1.9850 | 0.2141 | 0.0826 | -850.96 |
Random Forest | 0.9028 | 6.8158 | 4.8608 | 2.5779 | -856.24 |
Polynomial Regression | 0.0942 | 20.8083 | 16.4580 | 8.3505 | -812.18 |
Lasso Regression | 0.0074 | 21.7824 | 17.3055 | 8.7367 | -850.96 |
Ridge Regression | 0.0461 | 21.3541 | 16.9983 | 8.6534 | -842.10 |
ElasticNet Regression | 0.0078 | 21.7777 | 17.2995 | 8.7323 | -850.96 |
Polynomial Regression Lasso | 0.0092 | 21.7632 | 17.2855 | 8.6997 | -850.96 |
Polynomial Regression Ridge | 0.0932 | 20.8201 | 16.4720 | 8.3727 | -814.50 |
Polynomial Regression ElasticNet | 0.0128 | 21.7228 | 17.2442 | 8.6788 | -850.96 |
XGBoost | 0.7736 | 10.4037 | 7.6817 | 3.4780 | -851.07 |
On the Validation data
Algorithm | R² | RMSE | MAE | MAPE | MPE |
---|---|---|---|---|---|
Linear Regression | 0.0399 | 21.4114 | 17.0398 | 8.6825 | -845.13 |
Decision Tree | 0.0636 | 21.1462 | 16.8435 | 8.3958 | -845.36 |
Random Forest | 0.3410 | 17.7386 | 12.9303 | 7.0331 | -858.97 |
Polynomial Regression | 0.0665 | 21.1132 | 16.7499 | 8.5479 | -831.68 |
Lasso Regression | 0.0399 | 21.4114 | 17.0398 | 8.6825 | -845.86 |
Ridge Regression | 0.0399 | 21.4113 | 17.0393 | 8.6823 | -845.11 |
ElasticNet Regression | 0.0399 | 21.4114 | 17.0398 | 8.6825 | -845.86 |
Polynomial Regression Lasso | 0.0669 | 21.1088 | 16.7435 | 8.5618 | -845.88 |
Polynomial Regression Ridge | 0.0676 | 21.1008 | 16.7385 | 8.5603 | -832.97 |
Polynomial Regression ElasticNet | 0.0561 | 21.2299 | 16.8317 | 8.6552 | -846.06 |
XGBoost | 0.0912 | 20.8323 | 16.5531 | 8.4874 | -846.02 |
On the Test data
Algorithm | R² | RMSE | MAE | MAPE | MPE |
---|---|---|---|---|---|
Linear Regression | 0.0512 | 21.4939 | 17.1442 | 8.5314 | -830.20 |
Decision Tree | 0.0905 | 21.0440 | 16.8298 | 7.8832 | -847.59 |
Random Forest | 0.4073 | 16.9883 | 12.1913 | 6.3176 | -861.50 |
Polynomial Regression | 0.0884 | 20.8712 | 16.5396 | 8.4228 | -819.23 |
Lasso Regression | 0.0512 | 21.4939 | 17.1442 | 8.5314 | -849.31 |
Ridge Regression | 0.0512 | 21.4939 | 17.1438 | 8.5324 | -830.30 |
ElasticNet Regression | 0.0512 | 21.4939 | 17.1442 | 8.5314 | -849.31 |
Polynomial Regression Lasso | 0.0873 | 20.8844 | 16.5516 | 8.4383 | -848.96 |
Polynomial Regression Ridge | 0.0883 | 20.8728 | 16.5415 | 8.4282 | -819.96 |
Polynomial Regression ElasticNet | 0.0613 | 21.1797 | 16.8022 | 8.5827 | -848.96 |
XGBoost | 0.7118 | 11.7349 | 8.7977 | 4.0081 | -849.05 |
- XGBoost demonstrates the model's ability to provide a balance between bias and variance, resulting in an effective prediction of song popularity.
In our quest to predict song popularity, XGBoost again demonstrated its prowess, achieving an R² of 0.712 on the test data. This indicates that the model can explain approximately 70% of the variability in song popularity, suggesting a moderate but promising predictive ability. Additionally, the RMSE of 11.73 signifies that, on average, the model's predictions deviate from the actual song popularity by approximately 11.73 points. This level of error is reasonable within the context of song popularity, reflecting the model's effectiveness while also highlighting areas for potential improvement.
Algorithm | Silhouette Score | Number of Clusters |
---|---|---|
K-Means | 0.2331 | 3 |
K-Means with PCA | 0.3281 | 3 |
Affinity Propagation | 0.2238 | 3 |
Our clustering analysis focused on segmenting wines based on their chemical characteristics, employing both K-Means, K-Means with PCA and Affinity Propagation algorithms. Remarkably, both methods identified three distinct clusters with a high degree of agreement (Adjusted Rand Index between 0.951 and 0.758). This suggests a robust and reliable segmentation of the wines. By leveraging these clusters, wine producers could potentially increase sales and customer satisfaction by offering products that are precisely aligned with distinct market segments.
This repository documents a series of successful machine learning projects, demonstrating Data Money's commitment to leveraging data science to solve real-world problems across diverse industries. Our exploration of customer satisfaction prediction, song popularity forecasting, and wine preference clustering has yielded valuable insights and provided a foundation for data-driven solutions that can empower businesses to thrive in today's data-driven world.
To further enhance the models and analysis, we are considering the following steps:
- Feature Engineering: Explore and engineer new features based on domain knowledge and data exploration, potentially combining existing features or incorporating external data sources.
- Advanced Algorithms: Experiment with more sophisticated algorithms, such as gradient boosting methods (LightGBM) or neural networks, to potentially improve predictive performance.
- Ensemble Techniques: Investigate different ensemble techniques, beyond Random Forest and XGBoost, to combine the strengths of multiple models and further enhance accuracy and robustness.
- Data Collection and Augmentation: Gather additional data, especially for the regression task, to capture more factors that influence song popularity, such as social media engagement, marketing data, and expert reviews.
- Deployment and Monitoring: Deploy the models into a production environment, potentially creating a user interface or integrating them into existing systems for real-time predictions. Implement monitoring mechanisms to track model performance over time and retrain the models as needed to maintain accuracy.
This section guides you through exploring the notebooks of this project and understanding how they work.
Prerequisites:
- This project was developed using Python version 3.12.4
- You need to have a Jupyter Notebook environment installed.
Libraries:
Instructions:
- Clone this repository to your local machine:
git clone https://github.com/Daniel-ASG/Intro_ML_CDS.git
- Install the required libraries:
pip install -r requirements.txt
- Open the Jupyter Notebooks and run the cells.
Made by Daniel Gomes. Feel free to reach out!