This machine learning pipeline automates the complete data science workflow from raw data to production-ready models.
β’ Data Fetching from Kaggle
- Retrieves datasets from Kaggle platform for training purposes
Seamlessly connects to Kaggle API to download and access high-quality datasets for machine learning projects.
β’ Data Ingestion
- Loads and processes raw data into the pipeline infrastructure
Efficiently handles multiple data formats and sources with robust error handling and validation mechanisms.
β’ Data Validation
- Ensures data quality, completeness, and schema compliance
Performs comprehensive data integrity checks and identifies missing values, outliers, and data inconsistencies.
β’ Data Transformation
- Performs feature engineering, preprocessing, and statistical tests
Applies scaling, encoding, feature selection, and statistical analysis to prepare data for optimal model performance.
β’ Model Training
- Trains machine learning models using processed data
Implements various algorithms with hyperparameter tuning and cross-validation for robust model development.
β’ Model Evaluation
- Assesses model performance and validates results
Generates detailed performance metrics, confusion matrices, and validation reports to ensure model reliability.
β’ MLflow Tracking
- Monitors experiments, parameters, and model versions
Tracks all experiment runs, logs parameters, metrics, and artifacts for complete experiment management and reproducibility.
β’ DVC Integration
- Manages data versioning and pipeline reproducibility
Maintains version control for datasets and models, enabling collaborative development and pipeline reproducibility.
β’ Apache Airflow Orchestration
- Automates and schedules the entire pipeline workflow
Provides robust workflow scheduling, dependency management, and monitoring with automated retry mechanisms and alerting.
Execute the complete machine learning pipeline with a single command:
python run main.py
This command initiates the entire workflow, processing data through all stages from ingestion to model evaluation.
Navigate to the artifacts
folder to examine the output of each pipeline stage:
cd artifacts
Stage-wise Outputs:
data_ingestion/
- Contains the ingested training datasetdata_validation/
- Validation status and reportsdata_transformation/
- Transformed datasets and statistical test resultsmodel_trainer/
- Trained model files and parametersmodel_evaluation/
- Performance metrics and evaluation reports
Track all experiments, parameters, and model versions through DagsHub MLflow Integration:
π View Model Tracking Dashboard
Monitor experiment runs, compare model performance, and access detailed logging information for complete pipeline transparency.
Note: Apache Airflow requires Linux-based systems for optimal performance. Windows users should use WSL or Linux environment.
cp -r /mnt/d/MachineLearning-PipeLine ~/
# Install Python3 if not already installed
sudo apt update
sudo apt install python3 python3-pip python3-venv
# Navigate to project directory
cd ~/MachineLearning-PipeLine
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Set Airflow home directory
export AIRFLOW_HOME=~/airflow
echo $AIRFLOW_HOME
# Configure Airflow settings
vim ~/airflow/airflow.cfg
-> insert - i
Replace "auth_manager = airflow.api_fastapi.auth.managers.simple.simple_auth_manager.SimpleAuthManager" in airflow.cfg file
with "auth_manager=airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager".
After Updation press- > esc -> :wq!
# Create DAGs directory
mkdir -p ~/airflow/dags
# Copy DAG file to Airflow directory
cp model_dag.py ~/airflow/dags/
# Test DAG configuration
python ~/airflow/dags/model_dag.py
# Start Airflow standalone mode
airflow standalone
- Open your web browser and navigate to: http://0.0.0.0:8080
- Search for
ml_pipeline_dag
in the DAGs list - Click on the DAG and trigger the pipeline execution
- Monitor the workflow progress through the Airflow UI
The pipeline triggers automatically based on code changes and performs the following actions:
- Build - Creates Docker images from the application code
- Scan - Performs security and vulnerability scans on the built images
- Deliver - Pushes the validated images to DockerHub registry
# Initialize Minikube cluster
minikube start
# Deploy the application using deployment manifest
kubectl apply -f deployment.yaml
# Check pod status
kubectl get pods
# Apply service configuration
kubectl apply -f service.yaml
# Verify service deployment
kubectl get svc
# Forward service port to local machine
kubectl port-forward svc/mlapp-service 8000:80
# Access application in browser
# Navigate to: http://localhost:8000
# Edit service configuration
kubectl edit svc mlapp-service
# Change service type from
NodePort to LoadBalancer
esc - :wq! - enter
# Save and exit the editor
# Open new terminal and create tunnel
minikube tunnel
# In original terminal, check for external IP
kubectl get svc
# Access application using external IP in browser
127.0.0.1
# Deploy the ingress.yaml file
kubectl apply -f ingress.yaml
# Install the Ingress Controller (nginx)
minikube addons enable ingress
# Check the downloaded Ingress
kubectl get pods -A | grep nginx
# Check Ingress is Deployed
kubectl get ingress # A Address is Being Updated like -> 192.168.49.2
# for setup local system configuraton
sudo vim /etc/hosts
# Add
127.0.0.1 localhost
127.0.1.1 Abis-PC. Abis-PC
192.168.49.2 foo.bar.com
esc - :wq!
# Check Upadted or not
ping foo.bar.com
# then go to browser
http://foo.bar.com/demo
http://foo.bar.com/admin
This comprehensive machine learning pipeline demonstrates a complete end-to-end MLOps implementation that transforms raw data into production-ready predictive models. The project successfully integrates modern data science practices with robust automation and deployment capabilities.
Key Achievements:
β’ Automated Workflow - Seamlessly orchestrates data ingestion, validation, transformation, model training, and evaluation through Apache Airflow, reducing manual intervention and ensuring consistent pipeline execution.
β’ Comprehensive Tracking - Implements MLflow and DVC integration for complete experiment tracking, model versioning, and data lineage management, enabling reproducible research and collaborative development.
β’ Production-Ready Deployment - Provides containerized deployment through Docker and Kubernetes with CI/CD pipeline integration, ensuring scalable and maintainable model serving capabilities.
β’ Quality Assurance - Incorporates rigorous data validation, statistical testing, and model evaluation frameworks that guarantee reliable and trustworthy machine learning outcomes.
β’ Modern MLOps Practices - Leverages industry-standard tools and methodologies including automated testing, version control, monitoring, and deployment strategies for enterprise-grade machine learning solutions.
This pipeline serves as a robust foundation for data science teams looking to implement scalable, maintainable, and production-ready machine learning workflows.