ApplianceInsight: Web Scraping, ML Label Validation, and Visualization for Energy-Efficient Appliances

Description

This project, developed as part of the 4th semester of Datamatiker/Computer Science at UCN, aims to extract information using a web scraper and validate the data through machine learning.

Overview

The project comprises several components:

Web Scraping:
- Utilizes Scrapy, a Python library, to extract relevant links from sitemaps to household appliances from a predefined list of websites.
- Employs Playwright to open the links, extract information, and capture screenshots of the relevant content.
Machine Learning Validation:
- Utilizes FastAPI and a pre-trained object detection model based on YOLO via the Ultralytics framework.
- Screenshots are processed by the model to determine adherence to EU energy label laws:
  - Identifies if it's a new EU energy label, a pre-2021 label, or no label detected.
- Information, along with previously collected data, is saved to a MongoDB database.
Database:
- FastAPI is used to access the MongoDB database with endpoints for data manipulation (POST and DELETE).
Frontend:
- An Angular-based frontend interacts with the MongoDB API to display products from specified sites.
- Features a grid-like list of products, statistics, and a pie diagram showcasing the distribution of new, old, and unlabeled products.
- The website is in Danish to cater to local users.

Usage

Provide instructions on how to:

Set up the project environment.
Install necessary dependencies (python requirement files and npm install).
Run the different components of the project.

Screenshots/Demo

The dashboard displays products from specified sites, along with statistics and a pie diagram showcasing the distribution of new, old, and unlabeled products. The website features a grid-like list of products, statistics, and a pie diagram showcasing the distribution of new, old, and unlabeled products.

Technologies Used

Scrapy

Purpose: Web scraping framework in Python used to extract relevant links to household appliances from a predefined list of websites.
Key Features:
- Efficiently extracts structured data from websites.
- Enables the creation of robust web crawlers.

Playwright

Purpose: Headless browser automation library used alongside Scrapy to open links, extract information, and capture screenshots of relevant content.
Key Features:
- Provides cross-browser compatibility for web automation.
- Allows interaction with web pages programmatically.

FastAPI

Purpose: Python-based web framework utilized to create APIs for interacting with the machine learning validation process and the MongoDB database.
Key Features:
- High performance and asynchronous support.
- Simplified and easy-to-use API development.

Ultralytics (YOLO model)

Purpose: Pre-trained object detection model based on the YOLO (You Only Look Once) architecture employed to validate screenshots.
Key Features:
- Efficient real-time object detection.
- Flexibility and accuracy in identifying objects within images.
- Little code required to implement (only 1 lines of code).

MongoDB

Purpose: NoSQL database used to store extracted data, including information from appliances and their respective energy labels.
Key Features:
- Document-oriented database for flexibility in storing unstructured data.
- Scalability and ease of integration with Python.

Angular

Purpose: Frontend framework used to build the user interface that interacts with the MongoDB API to display product information.
Key Features:
- Component-based architecture for building dynamic web applications.
- Two-way data binding and dependency injection for efficient development.

Contributors

The project was developed by a group of 4 students:

Christian
Oliver
Mads
Lucas

Additional Notes

The project is not complete, and there are several areas that could be improved upon:

The web scraper could be improved to extract more information from the websites.
The machine learning validation could be improved to identify more information from the screenshots, and expanded to the full energy labels.
The frontend could be improved to display more information from the database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ApplianceInsight: Web Scraping, ML Label Validation, and Visualization for Energy-Efficient Appliances

Description

Overview

Usage

Screenshots/Demo

Technologies Used

Scrapy

Playwright

FastAPI

Ultralytics (YOLO model)

MongoDB

Angular

Contributors

Additional Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

ApplianceInsight: Web Scraping, ML Label Validation, and Visualization for Energy-Efficient Appliances

Description

Overview

Usage

Screenshots/Demo

Technologies Used

Scrapy

Playwright

FastAPI

Ultralytics (YOLO model)

MongoDB

Angular

Contributors

Additional Notes