Hello👋, I am Andrejs
I am a passionate data scientist and ML engineer on a continuous journey of learning by doing, and this is my MLOps portfolio project for the MLOps Zoomcamp course.
The expansion of information outlets in the digital era is akin to a two-sided coin. On one hand, it has equalized the distribution of knowledge and news, but on the other, it has facilitated the dissemination of misinformation and spurious news. Such misleading information has the potential to warp public conversation, sway personal convictions, and potentially manipulate the results of elections and public health policies.
Considering the significant implications, the urgency for robust and scalable methods to distinguish authentic news from misinformation is paramount. This is an area where the power of machine learning can be crucially employed.
This project goal is two-fold. Apparent goal is to derive a machine learning model that can efficiently categorize news articles as genuine or fake, relying on their content and headline. We will harness natural language processing methodologies and advanced machine learning algorithms. Our model is designed to scrutinize the textual characteristics of news articles for its classification.
But the actual learning goal of the MLOps Zoomcamp is to create an example or template repository employing most up-to-date MLOps and data management practices. Therefore, we utilize the first goal above as a backbone to a comprehensive ML engineering workflow, spanning every stage of the MLOps process. See Solution architecture for more information.
In short, to replicate the project one needs to:
- Fullfil the Pre-requisites
- Setup the Infrastructure
- Setup Orchestration and Experiment tracking
- Deploy Prefect flows
- Run model training at least once.
- Deploy the best model as a web service either locally or using Cloud Run
- Monitoring
- Best practices
For full understanding, please, refer to Solution architecture, Project organization, full list of project components and overall Project progress.
Project completion evaluation according to the Zoomcamp criteria see in #Zoomcamp-criteria-self-evaluation. See the overview of implemented features and TODOs in PROJECT_PROGRESS.md.
- This project is primarily based on Fake and real news dataset @ Kaggle
As noted by the community, data collection methodology of this dataset is questionable, which is probably why it is possible to reach very high accuracy scores close to 100%.
Though, this does not disturb the main goals of this project as it is focused on the MLOps and best software practices while working with ML tasks. This particular NLP application is just an example. Adding more datasets for training and validation is on the project TODO list.
- ==Generate tree and describe each folder\file==
Area | Description |
---|---|
Problem description | Explains the project goals, motivation and general outline |
Infrastructure | Shows how to setup GCP project, cloud resources, venv, VM tooling, etc. |
Orchestration | Controlled and scheduled execution of flows and tasks using Prefect |
Training | Train an LSTM model for text classification |
Deployment | Deploy the model as a service for inference |
Monitoring | |
Best practices | unit-testing, integration tests, auto-formatting, etc. |