Should I use Paddle.io.DataLoader to prepare data for PaddleOCR #7009

bdeng3 · 2022-07-27T07:37:43Z

bdeng3
Jul 27, 2022

Hey,

I'm trying to utilize Paddle.io.DataLoader to run PaddleOCR on a large number of images stored on disk. The code looks like this:

ocr = PaddleOCR(use_angle_cls=False, lang='en')
for batch_id, data in enumerate(self._data_loader):
    img, _ = data
    result = ocr.ocr(img, cls=False)

However the data return from dataloader is a tensor, while PaddleOCR requires input type to be np.ndarray, list or string. Should I discontinue using dataloader (although it could quickly provide us with multiprocessing & auto-batching)?

Basically, I'm interested in knowing how to achieve the best efficiency during the inference stage. So far from the "quick_start" guide, we're feeding images to PaddleOCR one by one, is there any "batch inference" method? I also notice there is a "use_mp" option where we could utilize multi-processing to speed up the inference, however since we're using GPU to do inference, I guess "use_mp" won't be of help here.

Answered by andyjiang1116

Jul 28, 2022

Paddle.io.DataLoader doesn't work.
Since the input batch size must be 1 if det is used, so if you want to speed the inference, you can use use_mp https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/tools/infer/predict_system.py#L207
, if you only use the rec, you can set rec_batch_num to a big number in here
https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/tools/infer/utility.py#L86

View full answer

andyjiang1116 · 2022-07-28T11:59:14Z

andyjiang1116
Jul 28, 2022
Collaborator

Paddle.io.DataLoader doesn't work.
Since the input batch size must be 1 if det is used, so if you want to speed the inference, you can use use_mp https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/tools/infer/predict_system.py#L207
, if you only use the rec, you can set rec_batch_num to a big number in here
https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/tools/infer/utility.py#L86

7 replies

bdeng3 Jul 31, 2022
Author

Thanks for the prompt reply!

Awesome. I notice use_mp is provided as the command line API, but not in Python API. I guess if we wanted to use use_mp in Python script, we'd need to implement it ourselves using the multiprocessing module?
I was wondering what is the difference between the multiprocessing provided by Dataloader vs by use_mp. If I understand right, both aim to speed up the input image reading speed so that we can feed more images into GPU at one time. Is this true?

andyjiang1116 Aug 1, 2022
Collaborator

you can implement it refer to this code https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/tools/infer/predict_system.py#L215-L226
Dataloader is complicated, you can also try to use both of them if they can speed up the inference.

bdeng3 Aug 1, 2022
Author

I see, thanks!!

bdeng3 Aug 8, 2022
Author

Hey @andyjpaddle

I've tried out both the Dataloader as well as the multiprocessing to speed up inference, and it indeed gets a boost in speed. However, it comes with a new problem:

I'm using either 2 processes (hence 2 dataloaders at the same time) to do inferences. The batch_size has already been set to 1. So if I understand correctly, the GPU is only processing either 1 or 2 images at a time, and with my 22GB GPU memory, it's still not enough to finish the calculation.

I was wondering what choices I have to tackle this problem. Resizing the images? Casting image datatype from float to int? Or simply use a more powerful GPU?

I'm also surprised how much GPU memory it requires to just process one image. Sometimes on this error page, I can see it is asking for 10GB+. Is this normal given that we have three models (det + cls + rec)?

andyjiang1116 Aug 9, 2022
Collaborator

It seems like it's not normal, maybe there is memory leak. You can test only one speed method once and see the GPU memory used.

HenryJunW · 2024-03-06T01:40:33Z

HenryJunW
Mar 6, 2024

@bdeng3 Do you manage to use both dataloader and use_mp to speed up the inference? Thanks in advance.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should I use Paddle.io.DataLoader to prepare data for PaddleOCR #7009

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Should I use Paddle.io.DataLoader to prepare data for PaddleOCR #7009

bdeng3 Jul 27, 2022

Replies: 2 comments · 7 replies

andyjiang1116 Jul 28, 2022 Collaborator

bdeng3 Jul 31, 2022 Author

andyjiang1116 Aug 1, 2022 Collaborator

bdeng3 Aug 1, 2022 Author

bdeng3 Aug 8, 2022 Author

andyjiang1116 Aug 9, 2022 Collaborator

HenryJunW Mar 6, 2024

bdeng3
Jul 27, 2022

Replies: 2 comments 7 replies

andyjiang1116
Jul 28, 2022
Collaborator

bdeng3 Jul 31, 2022
Author

andyjiang1116 Aug 1, 2022
Collaborator

bdeng3 Aug 1, 2022
Author

bdeng3 Aug 8, 2022
Author

andyjiang1116 Aug 9, 2022
Collaborator

HenryJunW
Mar 6, 2024