Skip to content

fluentnumbers/mlops_pipeline_fake_news

Repository files navigation

Hello👋, I am Andrejs

LinkedIn Medium Github

I am a passionate data scientist and ML engineer on a continuous journey of learning by doing, and this is my MLOps portfolio project for the MLOps Zoomcamp course.

Fake news detection

banner

Problem description

The expansion of information outlets in the digital era is akin to a two-sided coin. On one hand, it has equalized the distribution of knowledge and news, but on the other, it has facilitated the dissemination of misinformation and spurious news. Such misleading information has the potential to warp public conversation, sway personal convictions, and potentially manipulate the results of elections and public health policies.

Importance

Considering the significant implications, the urgency for robust and scalable methods to distinguish authentic news from misinformation is paramount. This is an area where the power of machine learning can be crucially employed.

Project goals

This project goal is two-fold. Apparent goal is to derive a machine learning model that can efficiently categorize news articles as genuine or fake, relying on their content and headline. We will harness natural language processing methodologies and advanced machine learning algorithms. Our model is designed to scrutinize the textual characteristics of news articles for its classification.

But the actual learning goal of the MLOps Zoomcamp is to create an example or template repository employing most up-to-date MLOps and data management practices. Therefore, we utilize the first goal above as a backbone to a comprehensive ML engineering workflow, spanning every stage of the MLOps process. See Solution architecture for more information.

Fast-track run

In short, to replicate the project one needs to:

  1. Fullfil the Pre-requisites
  2. Setup the Infrastructure
  3. Setup Orchestration and Experiment tracking
  4. Deploy Prefect flows
  5. Run model training at least once.
  6. Deploy the best model as a web service either locally or using Cloud Run
  7. Monitoring
  8. Best practices

For full understanding, please, refer to Solution architecture, Project organization, full list of project components and overall Project progress.

Project progress

Project completion evaluation according to the Zoomcamp criteria see in #Zoomcamp-criteria-self-evaluation. See the overview of implemented features and TODOs in PROJECT_PROGRESS.md.

Datasets

As noted by the community, data collection methodology of this dataset is questionable, which is probably why it is possible to reach very high accuracy scores close to 100%.

Though, this does not disturb the main goals of this project as it is focused on the MLOps and best software practices while working with ML tasks. This particular NLP application is just an example. Adding more datasets for training and validation is on the project TODO list.

Solution architecture

Alt text

Project organization

  • ==Generate tree and describe each folder\file==

Project components and reproducibility

Area Description
Problem description Explains the project goals, motivation and general outline
Infrastructure Shows how to setup GCP project, cloud resources, venv, VM tooling, etc.
Orchestration Controlled and scheduled execution of flows and tasks using Prefect
Training Train an LSTM model for text classification
Deployment Deploy the model as a service for inference
Monitoring
Best practices unit-testing, integration tests, auto-formatting, etc.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published