Skip to content

Predict bike-sharing demand using machine learning pipeline for MLOps-Zoomcamp project, optimizing bike distribution and availability.

Notifications You must be signed in to change notification settings

kachiann/project-mlops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bike Sharing Demand Prediction

Project Description

This is the implementation of my project for the course mlops-zoomcamp from DataTalksClub. The goal of this project is to build an end-to-end machine learning pipeline to predict bike-sharing demand using historical data. This prediction will help optimize bike distribution and availability in a bike-sharing system. The main focus of the project is on creating a production service with experiment tracking, pipeline automation, and observability.

Problem Statement

Bike-sharing systems are becoming increasingly popular in urban areas as a convenient and eco-friendly mode of transportation. However, managing the distribution of bikes to meet demand is challenging. The objective of this project is to predict the number of bikes required at different stations at different times of the day to ensure optimal availability and customer satisfaction. By addressing these challenges through data analysis, the project aims to enhance the overall user experience, increase operational efficiency, and promote sustainable urban transportation.

Dataset

The dataset used for this project is the "Bike Sharing Demand" dataset, which includes historical data on bike rentals, weather conditions, and timestamps. This dataset is available on UCI Machine Learning Repository.

Project details

This repository has five folders: data, notebooks, models, src, and web_service.

  • The data folder contains the dataset for the project.
  • The notebooks folder contains Jupyter notebooks used for exploratory data analysis (EDA), and initial model experimentation.
  • The models folder stores the trained machine learning models and any related artifacts.
  • The src folder contains the source codes for the project.
  • The web_service folder contains the source codes for the model deployment.
  • The monitoring folder

Additional files

  • requirements.txt
    • Lists all the Python dependencies required for the project.
  • Dockerfile
    • Defines the Docker image for the project, specifying the environment and dependencies required to run the code.
  • deployment.yaml
    • Defines the Kubernetes deployment and service for the project.

Clouds

  • The project is deployed to Kubernetes and Docker.

Alt text

Quick Start: Run the whole thing in 5 minutes

To get started with this project, follow these steps in your terminal:

  1. Clone the Repository:

    Begin by cloning the project repository from GitHub:

    git clone https://github.com/kachiann/project-mlops.git
  2. Navigate to the Project Directory:

    Change your directory to the newly cloned project folder:

    cd project-mlops
  3. Set Up the Environment:

    Ensure you have a Python environment set up. You can create a virtual environment using:

    python3.11 -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`

    Upgrade pip: Once the virtual environment is activated, upgrade pip to the latest version:

    python -m pip install --upgrade pip
  4. Install Dependencies:

    Install the necessary packages:

    make setup
  5. Start the Model Training with MLflow:

    make mlflow

    Open a new tab. Ensure that you are still in the activated virtual environment when running the files:

    source venv/bin/activate

    This ensures that all dependencies are correctly referenced.

    Then run:

    make train
  6. Make Prediction:

    make deploy

    Open a new tab. Ensure that you are still in the activated virtual environment when running the files:

    source venv/bin/activate

    This ensures that all dependencies are correctly referenced.

    Then run:

    make predict

To run the implementation files:

Ensure that you are still in the activated virtual environment when running the files. This ensures that all dependencies are correctly referenced.

Implementation Details

1. Experiment Tracking and Model Registry:

  • Experiment Tracking with MLflow:

    Use MLflow to track experiments, metrics, and artifacts. Start the MLflow server with a remote backend:

    mlflow server --backend-store-uri sqlite:///backend.db

    Or

    Start the MLflow server with a remote backend and specify the default artifact root

    mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root ./mlruns

    Once the server is running, you can access the MLflow UI by navigating to the following URL in your web browser:

    http://localhost:5000

    This web interface allows you to visualize your experiments, compare model metrics, and manage your model registry.

    Note: Ensure that you are operating within the activated virtual environment (venv) throughout this process.

  • Model Development:

    Develop Linear Regression and Decision Tree Regressor models using scikit-learn on the bike-sharing dataset. Use the following command to execute the experiment tracking script:

    python src/experiment_tracking.py
  • Model Comparison and Registration:

    Compare model performance by examining metrics such as MAE and R2. Register the models in the MLflow Model Registry. To manage the model registry, use:

    python src/model_registry.py

    You have to input the run IDs for Linear Regression and Decision Tree Regressor generated after running the experiment_tracking.py file.

  • Model Loading:

    Implement functionality to load models from the saved pickle files for further use or deployment.

Alt text


2. Workflow Orchestration (Fully deployed):

Prefect is used to create and manage the entire ML pipeline. It is a powerful and open-source workflow orchestration tool that lets users design, monitor, and respond to data and machine learning pipelines using Python code. The ml_pipeline.py includes data ingestion, preprocessing, model training, evaluation, and deployment steps.

Prefect Task:

A Prefect Task is a Python function decorated with the @task decorator that represents discrete units of work within a Prefect workflow. ml_pipeline.py represents a machine learning pipeline that integrates MLflow for experiment tracking and Prefect for workflow management.

Task Customization: You can customize the task decorator with optional arguments like name, description, tags, cache settings, retries, and more.

Prefect Flow

The ml_pipeline() function, decorated with @flow, orchestrates the entire workflow:

  • Sets up MLflow tracking URI and experiment name
  • Executes each task in sequence
  • Passes data between tasks

Prefect Deployments

Deployments are flows stored on the local server or on the cloud and include important information for orchestrating your workflow remotely, such as scheduling and execution details. Using Prefect offers several advantages:

  • Monitoring the pipeline
  • Scheduling with CRON
  • Automatic retries on failure
  • Logging and observability
  • Notifications
  • Automated workflows requiring no human intervention

Usage Instructions

Note: Ensure that you are operating within the activated virtual environment (venv) throughout this process. If you haven’t activated your virtual environment yet, do so with the following command:

source venv/bin/activate  # On Windows use `venv\Scripts\activate`

1. Log in to Prefect Cloud

Before running the ml_pipeline.py file, log in to Prefect Cloud using the following command:

prefect cloud login

Follow the prompts to enter your Prefect Cloud API key and complete the login process.

2. Install Dependencies

Ensure all necessary dependencies are installed by running:

pip install -r requirements.txt

3. Test the Prefect Flow

Run the Python file in the terminal to test the Prefect Flow:

python src/ml_pipeline.py

4. Build and Deploy the Prefect Deployment Locally

To run the deployment locally, build the "Deployment" by providing the file and flow function names:

prefect deployment build src/ml_pipeline.py:ml_pipeline -n 'ml_pipeline_bike_sharing_analysis' -a --tag dev

5. Start Prefect Agents

Initialize the Prefect agents in a new terminal with the default work pool name:

prefect agent start -p 'default-agent-pool'

6. Run the Deployment

Open a new tab. Ensure that you are still in the activated virtual environment when running the files:

source venv/bin/activate

In a new terminal, execute the deployment:

prefect deployment run 'ml-pipeline/ml_pipeline_bike_sharing_analysis'

7. Use the Prefect UI

To run and monitor your workflows using the Prefect UI:

  1. Start the Prefect server:

    prefect server start
  2. Access the Prefect Dashboard by navigating to the following URL in your web browser:

    http://localhost:4200

  3. In the Deployment section of the Prefect UI, you can view the current Deployment along with its activity and tags.

Alt text


3. Model Deployment:

To deploy the trained machine learning models as a web service, we utilize Flask along with MLflow. The deployment script, deploy.py, is responsible for loading the model and serving predictions through a RESTful API.

For detailed usage instructions, please refer to the web_service folder in this repository.

Key Steps for Model Deployment:

  1. Launch the MLflow Server: Before running the deployment script, start the MLflow server locally by executing the following command:
mlflow server --backend-store-uri sqlite:///backend.db
  1. Run the Deployment Script: Execute the deploy.py file to start the Flask API:
python web_service/deploy.py

By default, Flask will start the app at http://127.0.0.1:8080.

Open a web browser or use an HTTP client to access http://127.0.0.1:8080/. You should see the message: "Welcome to the ML Prediction API!".

  1. Model Loading and Prediction Logic: The script automatically loads the latest production version of the registered model from MLflow for making predictions based on incoming requests.

  2. Requirements: The requirements.txt file lists the necessary dependencies for running the deployment script.

  3. Containerization: To deploy the application in a Docker container, a Dockerfile.

  4. Testing the API: Instructions for testing the deployed API are available in the web_service folder. You can test the API endpoint using curl, Postman, or any other HTTP client.

  5. Kubernetes Deployment: If intrested in deploying the Flask application on Kubernetes, a deployment.yaml file is provided in the web_service folder.

By following the instructions in the web_service folder, you can successfully deploy the bike-sharing demand prediction model as a web service using Flask and MLflow, allowing users to make predictions via API calls.


4. Model Monitoring:

Evidently AI is an open-source Python library for monitoring ML models during development, validation, and in production. It checks data and model quality, data drift, target drift, and regression and classification performance. We will integrate Evidently AI into the project to enhance our data and model monitoring capabilities.

Usage:

  1. Navigate to the Monitoring Folder:

    Change your directory to the monitoring folder:

    cd monitoring
  2. Run Docker Compose:

    Start the Evidently monitoring environment using Docker:

    docker compose up --build
  3. Start the MLflow UI:

    In a new terminal window, load the MLflow UI to track your experiments:

    mlflow server --backend-store-uri sqlite:///backend.db
  4. View Baseline Model Metrics:

    To view the baseline model metrics via the Evidently UI, you can run the Jupyter notebook baseline_model_evidently.ipynb located in the monitoring folder. Make sure to open the relevant notebook and execute the cells to generate the necessary visualizations.

  5. Run the Evidently UI:

    Execute the following command in the terminal to start the Evidently UI:

    evidently ui

If running on localhost, visit: http://localhost:8000 and not http://0.0.0.0:8000.

This setup allows you to monitor your machine learning models effectively, providing insights into data quality, model performance, and any potential drifts in your data. By integrating Evidently AI, you can ensure that your models remain robust and reliable in production.

Alt text

  1. Best Practices:

    To-Do List

  • Reproducibility

    The versions for all dependencies are specified.

  • Unit Tests Alt text

  • Integration test

    See tests folder. You can also run

      make test
  • Linter and/or formatter Alt text


Alt text


Added docstrings, improved variable naming, and addressed Pylint issues

Alt text

  • Makefile You can run

        make lint

    Alt text

  • pre-commit hooks

  • CI/CD pipeline

    See .github folder.

    To view the CI/CD pipelines, go to the develop branch and locate the .github/workflows directory. Inside, you'll find two files: ci-pipeline.yml for CI and cd-deploy.yml for CD.

  1. Services

About

Predict bike-sharing demand using machine learning pipeline for MLOps-Zoomcamp project, optimizing bike distribution and availability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published