[models] Add VIPTR recognition model #1867

felixdittrich92 · 2025-02-05T12:32:55Z

🚀 The feature

Paper: VIPTR
Implementation: https://github.com/cxfyxl/VIPTR

PyTorch implementation
TensorFlow implementation

Hi, 

I would like to suggest possibly introducing another state-of-the-art text recognition architecture to docTR.
[SVIPTR](https://paperswithcode.com/paper/viptr-a-vision-permutable-extractor-for-fast)
It's promising accurate results at low latency.

Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed.

Thanks for your consideration.

Inference latency should be comparable to crnn_mobilenet_v3_large and the results are hopefully comparable to parseq.
The addition is agreed.

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2025-02-05T12:35:43Z

If someone wants to work on this feel free to ping here. Otherwise I planned to start working on it after we have some strategy done to make docTR multilingual.

lkosh · 2025-02-28T12:34:48Z

Hi, if the issue is still open, I'd like to work on the pytorch implementation of this feature. I've been using doctr quite extensively lately and I'd like to help improving this wonderful project :)

felixdittrich92 · 2025-02-28T12:59:23Z

Hi, if the issue is still open, I'd like to work on the pytorch implementation of this feature. I've been using doctr quite extensively lately and I'd like to help improving this wonderful project :)

Hey @lkosh 👋

Sure highly apprecated 👍

I started already with the PT implementation maybe this would be a good starting point for you to continue:

main...felixdittrich92:doctr:viptr-torch

Some points which are missing:

1. Cleanup layers
1. I had in mind to refactor the VIPBlock that the final VIPNet can inerhit from nn.Sequential comparable to all the other classification models we have -> avoid the mixer_types condition
1. Implement the recognition part -> VIPNet as feature extractor from the classification module + custom (linear) head + CTC loss + postprocessor + base class _VIPTR for target building
1. Testing everything (dummy run -> I can provide a dataset for testing or test it on my machine)
1. Add missing unittest + documentation entries
port code to TensorFlow (optional - sec PR)

Maybe the vitstr PR as reference it shows all the required parts:

https://github.com/mindee/doctr/pull/1055/files

felixdittrich92 added this to the 1.0.0 milestone Feb 5, 2025

felixdittrich92 self-assigned this Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[models] Add VIPTR recognition model #1867

[models] Add VIPTR recognition model #1867

felixdittrich92 commented Feb 5, 2025 •

edited

Loading

felixdittrich92 commented Feb 5, 2025

lkosh commented Feb 28, 2025

felixdittrich92 commented Feb 28, 2025 •

edited

Loading

[models] Add VIPTR recognition model #1867

[models] Add VIPTR recognition model #1867

Comments

felixdittrich92 commented Feb 5, 2025 • edited Loading

🚀 The feature

felixdittrich92 commented Feb 5, 2025

lkosh commented Feb 28, 2025

felixdittrich92 commented Feb 28, 2025 • edited Loading

felixdittrich92 commented Feb 5, 2025 •

edited

Loading

felixdittrich92 commented Feb 28, 2025 •

edited

Loading