Skip to content

ChristianHolm1/hvidevare_scanning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ApplianceInsight: Web Scraping, ML Label Validation, and Visualization for Energy-Efficient Appliances

Description

This project, developed as part of the 4th semester of Datamatiker/Computer Science at UCN, aims to extract information using a web scraper and validate the data through machine learning.

Overview

The project comprises several components:

  • Web Scraping:

    • Utilizes Scrapy, a Python library, to extract relevant links from sitemaps to household appliances from a predefined list of websites.
    • Employs Playwright to open the links, extract information, and capture screenshots of the relevant content.
  • Machine Learning Validation:

    • Utilizes FastAPI and a pre-trained object detection model based on YOLO via the Ultralytics framework.
    • Screenshots are processed by the model to determine adherence to EU energy label laws:
      • Identifies if it's a new EU energy label, a pre-2021 label, or no label detected.
    • Information, along with previously collected data, is saved to a MongoDB database.
  • Database:

    • FastAPI is used to access the MongoDB database with endpoints for data manipulation (POST and DELETE).
  • Frontend:

    • An Angular-based frontend interacts with the MongoDB API to display products from specified sites.
    • Features a grid-like list of products, statistics, and a pie diagram showcasing the distribution of new, old, and unlabeled products.
    • The website is in Danish to cater to local users.

Usage

Provide instructions on how to:

  • Set up the project environment.
  • Install necessary dependencies (python requirement files and npm install).
  • Run the different components of the project.

Screenshots/Demo

The dashboard displays products from specified sites, along with statistics and a pie diagram showcasing the distribution of new, old, and unlabeled products. Screenshot 1 The website features a grid-like list of products, statistics, and a pie diagram showcasing the distribution of new, old, and unlabeled products. Screenshot 2 Screenshot 2

Technologies Used

  • Purpose: Web scraping framework in Python used to extract relevant links to household appliances from a predefined list of websites.
  • Key Features:
    • Efficiently extracts structured data from websites.
    • Enables the creation of robust web crawlers.
  • Purpose: Headless browser automation library used alongside Scrapy to open links, extract information, and capture screenshots of relevant content.
  • Key Features:
    • Provides cross-browser compatibility for web automation.
    • Allows interaction with web pages programmatically.
  • Purpose: Python-based web framework utilized to create APIs for interacting with the machine learning validation process and the MongoDB database.
  • Key Features:
    • High performance and asynchronous support.
    • Simplified and easy-to-use API development.
  • Purpose: Pre-trained object detection model based on the YOLO (You Only Look Once) architecture employed to validate screenshots.
  • Key Features:
    • Efficient real-time object detection.
    • Flexibility and accuracy in identifying objects within images.
    • Little code required to implement (only 1 lines of code).
  • Purpose: NoSQL database used to store extracted data, including information from appliances and their respective energy labels.
  • Key Features:
    • Document-oriented database for flexibility in storing unstructured data.
    • Scalability and ease of integration with Python.
  • Purpose: Frontend framework used to build the user interface that interacts with the MongoDB API to display product information.
  • Key Features:
    • Component-based architecture for building dynamic web applications.
    • Two-way data binding and dependency injection for efficient development.

Contributors

The project was developed by a group of 4 students:

Additional Notes

The project is not complete, and there are several areas that could be improved upon:

  • The web scraper could be improved to extract more information from the websites.
  • The machine learning validation could be improved to identify more information from the screenshots, and expanded to the full energy labels.
  • The frontend could be improved to display more information from the database.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published