Skip to content

An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.

Notifications You must be signed in to change notification settings

arv-anshul/yt-comment-sentiment

Repository files navigation

YouTube Comment Sentiment

An end-to-end project to predict the sentiment of YouTube video comments using Machine Learning.

Overview

This project focuses on building a sentiment analysis system for YouTube comments, complete with a FastAPI-based inference endpoint and insights-providing API endpoints. The development process included robust experimentation, tracking, and pipeline reproduction (using MLFlow and DVC).

diagram

Key Features

  • Inference Endpoint: Built using the FastAPI framework to classify sentiment of comments.
  • Insights Endpoints: Additional APIs to provide analytics around comment sentiments.
  • Experiment Tracking: Leveraged MLFlow for tracking experiments.
  • Pipeline Reproduction: Utilized DVC (Data Version Control) for reproducibility.
  • Text Vectorization: Used TfidfVectorizer for transforming text data into feature vectors.
  • Model Selection: Experimented with various models and selected HistGradientBoostingClassifier as the best-performing classifier.

Experimentation

The experimentation phase focused on optimizing hyperparameters for the TfidfVectorizer and HistGradientBoostingClassifier model. Below is a screenshot showcasing how different hyperparameter combinations impacted accuracy:

Experiment Results

Tech Stack

Tech Stack
Data Handling Polars
Backend Tools MLflow DVC FastAPI
Machine Learning scikit-learn NLTK
Frontend pnpm shadcn/ui Tailwind CSS Vite Vue.js
Dev Tools uv pre-commit Ruff Zed Loguru

⚠️ Improvements

  1. Merge both classifier model and vectorizer model which reduce the complexity of loading them using using MLFLOW_RUN_ID in app.py.
  2. After completing previous step, load model using MLFLOW_MODEL_URI env instead of MLFLOW_RUN_ID env.
  3. ⚠️ Try to use MLproject file to run ML Pipeline steps instead of dvc.yaml file. (Only if Possible)
    • Also investigate the use dvc here and try to know WHY, WHAT and HOW (part of it).
  4. Know the clear distinction and involvement between the source code of ML Pipeline, Backend.

Sentiment Model

  • We can handle the imbalance(ness) of the dataset training which might improve the model metrics.
  • We can also find more diverse data for this because I have seen (while EDA) that it contains many political comments.
  • We can use different text vectorization steps such as Vectorization + PCA.
  • We can fine-tune a BERT model and use it instead.

Important

Feel free to explore and contribute!