Valli Meenaa Vellaiyan
Niresh Subramanian
Venkata Subbarao Shirish Addaganti
Harshit Sampgaon
Prabhat Chanda
Praneeth Korukonda
In today's competitive e-commerce landscape, understanding customer feedback is critical for improving product offerings and enhancing the overall customer experience. This project focuses on leveraging sentiment analysis of Amazon reviews to gain deeper insights into customer sentiment. By categorizing reviews into positive, neutral, or negative, businesses like Amazon can extract actionable insights to inform product decisions, optimize customer service, and drive strategic initiatives aimed at improving customer satisfaction.
The goal of this project is to automate the end-to-end process of analyzing review data, from ingestion and pre-processing to model training, deployment, and monitoring. This automated solution not only increases operational efficiency but also helps Amazon to better understand how customers feel about their products and services. By analyzing sentiment trends over time, the model provides valuable insights into key pain points, emerging trends, and areas for improvement across different product categories.
Using advanced sentiment analysis, this project enables Amazon to identify critical feedback faster, enhance the customer experience, and make data-driven decisions that align with business goals such as increasing customer retention, boosting sales, and reducing return rates. This ultimately contributes to a more responsive and customer-focused business strategy, directly impacting business growth.
The UCSD Amazon Reviews 2023 dataset is a large, publicly available collection of customer reviews across multiple product categories on Amazon. It contains approximately 338 million reviews, spanning diverse customer experiences and sentiments.
The dataset provides a foundation for sentiment analysis and is integral to our project's goal of classifying reviews into positive, neutral, or negative categories. The large volume and diversity of the dataset make it ideal for building scalable machine learning models.
- Dataset Source: UCSD Amazon Reviews 2023
- Citation: Jérémie Rappaz, Julian McAuley, Karl Aberer. Recommendation on Live-Streaming Platforms: Dynamic Availability and Repeat Consumption, RecSys, 2021.
- Format: CSV/JSON
- Size: 338 million reviews
- Data Types: String, Numeric, List, Boolean, Dictionary, Timestamps
- Key Features:
- Review Text: The main content of customer feedback
- Star Rating: Ratings from 1 to 5 stars
- Product Category: Product category for the reviewed item
- Review Timestamp: Date and time of the review
- Product Metadata: Additional product-related details
- Verified Purchase: Indicator if the review is from a verified purchase
- Review Helpfulness: Upvotes or downvotes received by the review (if available)
- The dataset is available for non-commercial use, and user identifiers like reviewer IDs are excluded to prevent privacy breaches. The project adheres to data minimization principles and complies with relevant privacy regulations.
.
├── .dvc # DVC configuration files for data versioning
├── .github/workflows # GitHub Actions workflows for CI/CD
├── data_pipeline # Main data pipeline directory
├── milestones # Project milestone documents
├── model_pipeline # Machine learning pipelines
│ ├── rag # Retrieval-Augmented Generation pipeline
│ │ ├── config # RAG configuration files
│ │ ├── dags # DAGs for RAG
│ │ │ └── utils
│ │ └── data # Data for RAG pipeline
│ └── Streamlit # Streamlit-based interactive dashboards
│ └── items # Dashboard components
└── project_pipeline/ # Final project pipeline to run data and model pipelines
├── config
├── dags # Directed Acyclic Graph (DAG) files for Airflow workflows.
│ ├── batch_processing_dag.py # DAG for batch processing tasks.
│ ├── data_acquisition_dag.py # DAG for acquiring data.
│ ├── data_preprocessing_dag.py # DAG for data preprocessing tasks.
│ ├── data_utils # Directory containing utility scripts for data tasks.
│ │ ├── config.py # Configuration file for data utilities.
│ │ ├── data_collection # Data collection related scripts
│ │ ├── data_preprocessing # Data preprocessing related scripts
│ │ ├── data_validation # Data validation related scripts
│ │ ├── __init__.py # Initialization file for the data_utils package.
│ ├── data_validation_dag.py # DAG for data validation tasks.
│ ├── gcs_pull_dag.py # DAG for pulling data from Google Cloud Storage (GCS).
│ ├── __init__.py # Initialization file for the DAG package.
│ ├── model_utils # Directory for model utility scripts.
│ │ ├── dsl_components.py # Kubeflow components for the pipeline.
│ │ ├── dsl_pipeline.py # Pipeline definition using a components.
│ │ ├── model_config.py # Configuration file for models.
│ │ ├── pipeline_CI_CD.py # CI/CD pipeline for model deployment and monitoring.
│ │ └── src # Source code for model utility components
│ ├── sampling_serve_dag.py # DAG for serving sampled data.
│ ├── sampling_train_dag.py # DAG for training models with sampled data.
│ ├── serve_utils # Directory for serving-related utilities.
│ │ ├── batch_processing.py # Script for batch processing during model serving.
│ └── vertex_ai_submit_dag.py # DAG for submitting jobs to Vertex AI.
├── data # Directory for storing datasets.
│ ├── labeled # Labeled data for training and serving.
│ │ ├── serve # Data for model serving predictions.
│ │ └── train # Data for model training.
│ ├── raw # Raw data before any processing or cleaning.
│ └── sampled # Sampled data used for training or serving.
│ ├── serving # Sampled data for serving predictions.
│ └── training # Sampled data for training models.
├── docker-compose.yaml # Docker Compose file for orchestrating multi-container applications.
├── Dockerfile # Dockerfile to build the image for the pipeline.
├── __init__.py # Initialization script for the project pipeline package.
├── logs # Logs generated during the pipeline execution.
│ ├── dag_id=06_vertex_ai_pipeline_job_submission_with_run
│ ├── dag_id=vertex_ai_pipeline_job_submission_with_run
│ ├── dag_processor_manager
│ └── scheduler # Scheduler logs for specific dates.
│ ├── 2024-12-03
│ └── 2024-12-05
├── plugins # Custom plugins used in the pipeline.
└── tests # Unit and integration tests for various pipeline components.
├── data_collection # Tests for data collection processes.
│ ├── __init__.py # Initialization file for data collection tests.
│ ├── test_data_acquisition.py# Tests for data acquisition functionality.
│ └── test_data_sampling.py # Tests for data sampling processes.
├── data_preprocessing # Tests for data preprocessing tasks.
│ ├── __init__.py # Initialization file for data preprocessing tests.
│ ├── test_aspect_data_labeling.py # Tests for aspect-based data labeling.
│ ├── test_aspect_extraction.py # Tests for aspect extraction.
│ ├── test_data_cleaning_pandas.py # Tests for data cleaning using pandas.
│ └── test_data_labeling.py # Tests for data labeling functionality.
├── data_validation # Tests for data validation procedures.
│ └── test_schema_validation.py # Tests for schema validation of data.
└── __init__.py # Initialization file for the tests package.
Ensure you have the following installed:
- Python 3.8+
pip
orconda
for managing Python packages- Docker (for containerized pipeline execution)
- DVC (Data Version Control)
- Apache Airflow
- Google Cloud SDK (for cloud storage and pipeline integration)
-
Clone the repository:
git clone https://github.com/MLOps-Group-3/Amazon-Reviews-Sentiment-Analysis.git cd Amazon-Reviews-Sentiment-Analysis
-
Install dependencies:
pip install -r project_pipeline/requirements.txt
-
Set up Airflow:
-
Navigate to the
project_pipeline
directory and initialize Airflow:- Navigate to the
project_pipeline
directory.
cd project_pipeline
- Run the following command to set up Airflow with Docker Compose:
docker-compose up airflow-init
Once Airflow is initialized, you can start the Airflow service by running:
docker-compose up
This will bring up the Airflow web server and scheduler, ready to execute your DAGs.
- Save your GCP service account key in the
project_pipeline/config
directory asgcp-sa-key.json
. - Ensure that the keys and other environment variables are correctly set in the
.env
file.
In the
project_pipeline
directory, create a.env
file and add the following environment variables:OPENAI_API_KEY= OPENAI_MODEL= PINECONE_API_KEY= PINECONE_ENVIRONMENT=us-east-1-aws AIRFLOW_UID=50000 SMTP_USER= SMTP_PASSWORD= SMTP_MAIL_FROM= SMTP_HOST= SMTP_PORT= GCS_SERVICE_ACCOUNT_KEY=/opt/airflow/config/gcp-sa-key.json GCS_BUCKET_NAME= GCS_BUCKET_NAME_MODEL=
Ensure to replace the placeholders with your actual credentials and configuration values:
OPENAI_API_KEY
andOPENAI_MODEL
: For connecting to the OpenAI API.PINECONE_API_KEY
andPINECONE_ENVIRONMENT
: For connecting to Pinecone for vector database access.GCS_SERVICE_ACCOUNT_KEY
: Path to your GCP service account JSON file.GCS_BUCKET_NAME
: Name of the Google Cloud Storage bucket.
- Navigate to the
-
-
Run the pipeline:
- Trigger DAGs via the Airflow UI to execute stages of the pipeline (e.g., data acquisition, validation, preprocessing).
-
Set up DVC:
- Pull the latest data files:
dvc pull
- Pull the latest data files:
-
Configure GCP:
- Authenticate using the Google Cloud SDK:
gcloud auth login
- Set up your bucket for data storage and integration:
gcloud storage buckets create gs://your-bucket-name
- Authenticate using the Google Cloud SDK:
-
Test the pipeline:
- Run the tests for data collection, validation, and preprocessing:
pytest data_pipeline/tests
- Run the tests for data collection, validation, and preprocessing:
- For detailed setup instructions for the
data_pipeline
, refer to data_pipeline/README.md. - For setting up model pipelines and deployments, refer to the
model_pipeline
directory and its subfolders for specific configurations.
-
End-to-End Pipeline Automation:
- Modular DAGs in Apache Airflow for ingestion, validation, preprocessing, and sentiment analysis.
-
DVC Integration:
- Ensures version control for data and tracks changes across pipeline stages.
-
Cloud Integration:
- Utilizes Google Cloud Platform (GCP) for scalable data storage and model deployment.
-
Sentiment Analysis Models:
- Incorporates pre-trained models like BERT and RoBERTa for high-accuracy sentiment classification.
-
RAG for Summarization:
- Implements Retrieval-Augmented Generation for aspect-wise summarization of reviews.
-
Interactive Dashboards:
- Streamlit-based dashboards for visualization of insights and sentiment trends.
-
CI/CD with GitHub Actions:
- Automated testing and deployment workflows for robust pipeline operations.
This repository contains Airflow DAGs for preprocessing, validating, and analyzing Amazon review data through a series of modular tasks. Each DAG corresponds to a distinct stage in the data pipeline, leveraging Python and Pandas for data transformation, sampling, validation, and sentiment analysis.
-
Data Acquisition:
- Extracts and ingests Amazon review data from the UCSD dataset.
- Stores raw data for further processing.
-
Data Sampling:
- Samples review data across specified categories to create balanced datasets.
- Consolidates sampled data for downstream analysis.
-
Data Validation:
- Ensures dataset quality and consistency by checking schema, null values, and integrity.
- Logs invalid rows for further debugging and cleaning.
-
Data Preprocessing:
- Cleans data by removing duplicates, handling missing values, and normalizing text.
- Labels reviews with sentiment tags (positive, neutral, negative).
-
Sentiment Analysis and Summarization:
- Applies pre-trained models like BERT for sentiment classification.
- Implements RAG (Retrieval-Augmented Generation) for generating aspect-wise summaries.
- Apache Airflow:
- Orchestrates the pipeline with modular, reusable DAGs.
- DVC:
- Tracks data changes and versions across pipeline stages.
- MLflow:
- Manages model experiments and tracks metrics for sentiment analysis.
- The repository uses DVC for version control of datasets and intermediate outputs. Ensure you have the latest configuration pulled:
dvc pull
The model_pipeline
directory contains workflows for training, evaluating, and deploying machine learning models for sentiment analysis and summarization.
-
Sentiment Analysis Pipeline:
- Trains state-of-the-art models like BERT and RoBERTa.
- Performs hyperparameter tuning and bias detection to optimize performance.
-
RAG Pipeline:
- Uses a retrieval-based approach for generating aspect-wise summaries.
- Handles large document contexts for better summarization accuracy.
-
Deployment with PyTorch Serve:
- Deploys trained models as scalable REST APIs.
- Includes endpoint configurations and predictor utilities.
-
Interactive Dashboards:
- Built with Streamlit for exploring sentiment trends and insights dynamically.
- MLflow: Tracks model metrics, artifacts, and experiments.
- Vertex AI: Deploys trained models and monitors their performance in the cloud.
- Apache Airflow: Manages the orchestration of modular pipelines.