Skip to content

A fine-tuned language model that transforms raw OCR text from receipts and invoices into structured JSON, extracting key information.

Notifications You must be signed in to change notification settings

Sayanabha/Billify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Billify (Receipt Radar)

Billify is an advanced solution designed to convert raw OCR text from receipts and invoices into structured JSON format. By leveraging the power of OCR technology and a fine-tuned large language model (LLM), Billify automates data extraction and processing from physical documents, offering high efficiency and accuracy.

Table of Contents

Introduction

Billify automates the process of extracting structured data from receipts and invoices. This project utilizes PaddleOCR for text extraction and a fine-tuned LLM from Hugging Face Transformers for converting raw OCR text into structured JSON format.

Features

  • Image Input: Supports common image formats (JPEG, PNG, PDF) and batch processing.
  • OCR Text Extraction: Uses PaddleOCR to handle various layouts and fonts in receipts and invoices.
  • Structured JSON Output: Extracts key information such as store name, date, items, quantities, prices, and total amount.
  • Error Handling: Robust error management for unclear or erroneous OCR results.
  • Scalability: Designed to handle large volumes of images efficiently.
  • Security: Ensures data privacy and compliance with data protection regulations.

Installation

To set up Billify, follow these steps:

  1. Clone the repository:

    git clone https://github.com/yourusername/Billify.git
    cd Billify
  2. Install required libraries:

    • Install Hugging Face Transformers:
      pip install git+https://github.com/huggingface/transformers.git
      pip install accelerate
    • Install PaddleOCR:
      git clone https://github.com/PaddlePaddle/PaddleOCR.git
      cd PaddleOCR
      python3 -m pip install paddlepaddle-gpu
      pip install "paddleocr>=2.0.1"

Usage

  1. Import necessary libraries:

    import torch
    from paddleocr import PaddleOCR
    from transformers import pipeline
  2. Initialize PaddleOCR:

    ocr = PaddleOCR(use_angle_cls=True, lang='en')
  3. Extract OCR text from an image:

    img_path = 'path_to_receipt_image.jpg'
    ocr_result = ocr.ocr(img_path, cls=True)
    raw_text = " ".join([line[1][0] for line in ocr_result[0]])
  4. Convert raw text to structured JSON using the fine-tuned LLM:

    model = pipeline('text2json', model='zephyr-7b-alpha')
    def convert_to_json(raw_text):
        structured_data = model(raw_text)
        return structured_data
    structured_json = convert_to_json(raw_text)

Architecture

The system is organized into the following modules:

  1. Image Input Module: Handles the upload and storage of image files.
  2. OCR Processing Module: Uses PaddleOCR to extract text from images.
  3. Text-to-JSON Conversion Module: Employs a fine-tuned LLM to convert raw OCR text into structured JSON format.
  4. Output Module: Provides the structured JSON data to the user or integrates it with existing business systems.

Technical Details

Languages and Libraries

  • Languages: Python
  • Libraries:
    • OCR: PaddleOCR
    • LLM: Hugging Face Transformers
  • Hardware: GPUs for efficient processing

Performance and Scalability

  • Performance: The system is optimized for minimal latency and high throughput.
  • Scalability: Designed to handle large volumes of images efficiently.

Security

  • Ensures data privacy and security during processing and storage.
  • Compliance with relevant data protection regulations is maintained.

Business Feasibility

Market Analysis

  • Demand: Increasing need for automation in data entry and processing within various industries such as retail, finance, and logistics.
  • Growth Potential: Significant market opportunity due to businesses seeking ways to reduce operational costs and improve efficiency.

Competitive Advantage

  • Accuracy: High accuracy in data extraction by combining state-of-the-art OCR and LLM technology.
  • Efficiency: Quick processing of large volumes of receipts and invoices.
  • Flexibility: Adaptable to various receipt and invoice formats.

Cost-Benefit Analysis

  • Implementation Costs: Investment in development, hardware (GPUs), and ongoing maintenance.
  • Operational Savings: Significant reduction in manual data entry costs and time.
  • ROI: High return on investment due to improved efficiency and accuracy.

Risk Analysis

  • Technical Risks: Potential challenges in handling diverse receipt formats and maintaining accuracy.
  • Mitigation Strategies: Continuous model training and updates, extensive testing on diverse datasets.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Billify is a powerful tool designed to automate and streamline the extraction of structured data from receipts and invoices, enhancing operational efficiency and accuracy.

About

A fine-tuned language model that transforms raw OCR text from receipts and invoices into structured JSON, extracting key information.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published