Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TrOCR] How to run inference on multiline text image #628

Closed
mariababich opened this issue Feb 21, 2022 · 11 comments
Closed

[TrOCR] How to run inference on multiline text image #628

mariababich opened this issue Feb 21, 2022 · 11 comments

Comments

@mariababich
Copy link

Hello!

I am wondering how to run TrOCR for the whole image with a lot of text. The tutorials show how the model works with single line images. When tried to run it on image with a lot of text - it did not worked. How the inference could be scaled?

Thanks in advance, Mariia.

@wolfshow
Copy link
Contributor

@mariababich TrOCR is designed for single-line text recognition. You need to use a text detector to get textlines.

@NielsRogge
Copy link

Yes, you can combine TrOCR with CRAFT for instance:

  • CRAFT can handle the text detection
  • TrOCR can handle the text recognition.

@nyck33
Copy link

nyck33 commented Jul 15, 2023

@NielsRogge I just tried to use CRAFT but it's using torch < 1.0 which makes it impossible? So bard recommended paddleocr. Please let me know what you think. My final goal is to do exactly this, ocr on multiline text but my inputs are handwritten homework assignments for school kids.

@NielsRogge
Copy link

Hi @nyck33 you can try https://github.com/fcakyon/craft-text-detector which is a packaged and more up-to-date version of CRAFT

@nyck33
Copy link

nyck33 commented Jul 15, 2023

@NielsRogge thanks! It does look more up-to-date but I was getting the model_urls error so referenced this: clovaai/CRAFT-pytorch#191, tried downgrading torchvision to 0.13 and deleting those 2 lines and now I'm getting

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 4
      1 craft = Craft(output_dir=output_dir, crop_type="poly", cuda=True)
      3 # apply craft text detection and export detected regions to output directory
----> 4 prediction_result = craft.detect_text(image_path)
      6 #unload models from ram/gpu
      7 craft.unload_craftnet_model()

File /mnt/d/chatgpt/ocr/craft-text-detector/craft_text_detector/__init__.py:131, in Craft.detect_text(self, image, image_path)
    128     image = image_path
    130 # perform prediction
--> 131 prediction_result = get_prediction(
    132     image=image,
    133     craft_net=self.craft_net,
    134     refine_net=self.refine_net,
    135     text_threshold=self.text_threshold,
    136     link_threshold=self.link_threshold,
    137     low_text=self.low_text,
    138     cuda=self.cuda,
    139     long_size=self.long_size,
    140 )
    142 # arange regions
    143 if self.crop_type == "box":
...
--> 415         polys = np.array(polys)
    416         for k in range(len(polys)):
    417             if polys[k] is not None:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (31,) + inhomogeneous part.
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?6c1494cc-9da4-4d41-ad77-c5b933872a97) or open in a [text editor](command:workbench.action.openLargeOutput?6c1494cc-9da4-4d41-ad77-c5b933872a97). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

for the basic usage example in that repo and for the advanced:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 24
     21 craft_net = load_craftnet_model(cuda=True)
     23 # perform prediction
---> 24 prediction_result = get_prediction(
     25     image=image,
     26     craft_net=craft_net,
     27     refine_net=refine_net,
     28     text_threshold=0.7,
     29     link_threshold=0.4,
     30     low_text=0.4,
     31     cuda=True,
     32     long_size=1280
     33 )
     35 # export detected text regions
     36 exported_file_paths = export_detected_regions(
     37     image=image,
     38     regions=prediction_result["boxes"],
     39     output_dir=output_dir,
     40     rectify=True
     41 )

File /mnt/d/chatgpt/ocr/craft-text-detector/craft_text_detector/predict.py:91, in get_prediction(image, craft_net, refine_net, text_threshold, link_threshold, low_text, cuda, long_size, poly)
     89 # coordinate adjustment
...
--> 415         polys = np.array(polys)
    416         for k in range(len(polys)):
    417             if polys[k] is not None:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (31,) + inhomogeneous part.
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?0f6fa27f-da18-4605-a011-ebf8c3411d9b) or open in a [text editor](command:workbench.action.openLargeOutput?0f6fa27f-da18-4605-a011-ebf8c3411d9b). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

@nyck33
Copy link

nyck33 commented Jul 16, 2023

I'll make note that I tried out a bunch and KerasOCR so far was the best at drawing bounding boxes around handwritten text images. I also tried Donut on Hugging Face but the results were disappointing.

@bit-scientist
Copy link

Hi, @nyck33, I am going through exactly the same project as you have done. Could you share your recent insights as to which handwritten text detector worked best for your images? I'd appreciate your help. Thank you!

@nyck33
Copy link

nyck33 commented Aug 30, 2023 via email

@bit-scientist
Copy link

bit-scientist commented Aug 30, 2023

Oh, I see, thanks @nyck33. Are you using Cloud vision for text detection only or for both (detection+recognition)? How is it doing in terms of CER rate?

@anandhuh1234
Copy link

I've trained a YOLOv5 model specifically for detecting both handwritten and printed texts. After that, I extract and forward the identified handwritten lines from the image to TrOCR for processing.

@myhub
Copy link

myhub commented Mar 22, 2024

I think with some extra work TrOCR can also be used for multiline text image,
Based on my experiments crnn_for_text_with_multiple_lines, To make TrOCR suitable for multiline text image, one need to:

  • regenerate or label training samples with multiline text
  • retrain the model with a larger input image size (e.g. 512*512px)

And multiline text also means you need much more training samples than single-line.
Also the input image and output sequence will be larger which means you need much more GPUs to do the work

In some situation. text line detection is hard e.g. curved text, so I think it is meaningful to train a multiline-version TrOCR which reduce the need for text line detection
text line detection is hard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants