Skip to content

The module extracts text from image using the tesseract-OCR engine. Generally, text present in the images are blur or are of uneven sizes. The image is pre-processed for better comprehension by OCR. This module first makes bounding box for text in images and then normalizes it to 300 dpi, suitable for OCR engine to read.

License

Notifications You must be signed in to change notification settings

yardstick17/image_text_reader

Repository files navigation

image_text_reader

Gitter Build Status

It's a very basic tool to read images , images formatted like a restaurant-menu.

Tesseract-ocr

This tools need tesseract-ocr engine. Help yourself with this --

Linux

Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Thus you can install Tesseract 4.x and it's developer tools on Ubuntu 18.x bionic by simply running:

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

Refer here for more on installation on all other systems.

macOS

Homebrew

To install Tesseract run this command:

brew install tesseract
OCR reads the text extracted image from the full image. Click here

Command to use:

Dockerized image reading

docker run -it yardstick17/image-text-reader bash -c "PYTHONPATH='.' python3 read_image.py read_text_from_local_image -f images/sample_image.jpg"

Read from url

PYTHONPATH='.' python3 read_image.py read_text_from_image_url -u https://marketplace.canva.com/MACHUlPU93Q/1/0/thumbnail_large/canva-peach-green-leaves-garden-vegetarian-pizza-menu-MACHUlPU93Q.jpg

[2017-07-07 16:20:34,119] INFO : Downloading image from url: https://marketplace.canva.com/MACHUlPU93Q/1/0/thumbnail_large/canva-peach-green-leaves-garden-vegeta
[2017-07-07 16:20:35,997] INFO : Saving file: /var/folders/cz/n3vkz7x91qs06nmm9byxxgz00000gr/T/tmpienrxu2c
[2017-07-07 16:20:35,997] INFO : Processing image for text Extraction
[2017-07-07 16:20:36,308] INFO : Removing noise and smoothening image
[2017-07-07 16:20:36,431] INFO : Reading the text inside the contour plotted

Read from local image

PYTHONPATH='.' python3 read_image.py read_text_from_local_image -f images/sample_image.jpg

[2017-07-07 16:32:38,862] INFO : Processing image for text Extraction
[2017-07-07 16:32:39,232] INFO : Removing noise and smoothening image
[2017-07-07 16:32:39,442] INFO : Reading the text inside the contour plotted

Deploy an api for reading text from image!

PYTHONPATH='.' python3 api/app.py

[2017-07-07 16:49:57,818] INFO :  * Running on http://0.0.0.0:6600/ (Press CTRL+C to quit)
[2017-07-07 16:49:57,820] INFO :  * Restarting with stat
[2017-07-07 16:49:58,712] WARNING :  * Debugger is active!
[2017-07-07 16:49:58,738] INFO :  * Debugger pin code: 316-405-633

Sample api deployed on my tiny server. Please be patient with them.

curl -X POST \
  http://54.254.214.96/read_image_from_file/url \
  -F url=https://africatalentbank.com/wp-content/uploads/2014/10/Menu.jpg

Digital Menu

Digital Image

Original Image

Original Image

About

The module extracts text from image using the tesseract-OCR engine. Generally, text present in the images are blur or are of uneven sizes. The image is pre-processed for better comprehension by OCR. This module first makes bounding box for text in images and then normalizes it to 300 dpi, suitable for OCR engine to read.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published