Skip to content

Single Grapheme Prediction based model, Bangla HOCR

License

Notifications You must be signed in to change notification settings

mnansary/bnhocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bnhocr

Optical Character Recognition for bangla handwritten and printed documents

Single Grapheme Prediction based model, Bangla HOCR

   version:0.0.1
   authors:MD.Nazmuddoha Ansary, (team ovijatrik,apsis solutions ltd,bengali.ai)
           MD.Mobassir Hossain,  (team ovijatrik,apsis solutions ltd,bengali.ai) 
           MD.Aminul Islam       (team ovijatrik)

This project was created in association with:

Environment

  • For ubuntu install tesseract bangla: install tesseract-ocr-ben
  • For windows (Untested source):
    • Download and install tesseract-ocr-w64-setup-v5.0.0-rc1.20211030.exe (or the latest one)
    • Open https://github.com/tesseract-ocr/tessdata and download your language. For example, for Bangla download ben.traineddata.
    • Copy the downloaded file to the tessreact_ocr installation location, some location like: C:\Program Files\Tesseract-OCR\tessdata
    • Don't forget to use the traineddata name for the language. For bangla, I use lang='ben'.

python requirements

  • pip requirements: pip install -r requirements.txt

Its better to use a virtual environment Some of the pip requirements may not work properly due to locally saved modules OR use conda-

  • Preffered way: conda: use environment.yml: conda env create -f environment.yml

model requirements

  • Download model.h5 file
  • place the model.h5 file under models folder

LOCAL ENVIRONMENT/TESTING ENVIRONMENT

OS          : Ubuntu 20.04.3 LTS       
Memory      : 23.4 GiB 
Processor   : Intel® Corei5-8250U CPU @ 1.60GHz × 8    
Graphics    : Intel® UHD Graphics 620 (Kabylake GT2)  
Gnome       : 3.36.8

About bnhocr: Printed and Handwritten text recognition

There are available models such as : tesseract,Easy OCR. that covers bangla printed text to a considerable accuracy.In this project we solely focus on handwritten texts.

  • The idea of bnhocr project is to convert handwritten graphemes to a unique representational space (in our example font faced image).
  • Converting single graphemes does not cover words. Separated graphemes can be used if a grapheme localization model is used.Watch This Video To For The Basic Idea

  • To extend for words, we build on single grapheme transformation model and extend on handwritten words.
  • Then we use an existing recognizer (in our example-tesseract)

  • A short presentation about this work is available at resources/slides.pdf

Demo

  • clone the repo
  • install dependencies
  • streamlit run app.py

Graphemes

@inproceedings{alam2021large,
  title={A large multi-target dataset of common bengali handwritten graphemes},
  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={383--398},
  year={2021},
  organization={Springer}
}

Known Issues

  • model is not cached while running in streamlit
  • only launched for tesseract
    • Easy OCR and Detection models can be easily added for wide applications

About

Single Grapheme Prediction based model, Bangla HOCR

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages