Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Suggestions for Cyrillic + English #82

Open
eschaffn opened this issue Apr 13, 2023 · 3 comments
Open

Training Suggestions for Cyrillic + English #82

eschaffn opened this issue Apr 13, 2023 · 3 comments

Comments

@eschaffn
Copy link

Would it be possible to get training recommendations w.r.t data and parameters?

I'm trying to retrain parseq with a new character set consisting of both Latin (English alphabet) and Cyrillic (Russian alphabet) characters.

I have about 3500 custom image samples that I created by running detection and then cropping out the text. Example:
image

I have a few questions,

Is this a suitable training image?
If I have ~3500 of images like these, both in English and Russian, how much synthetic data should I augment this with? Or do I need more real data too?
My charset is ~160 what should I set the embedding dimension too? Is 384 large enough?

Thank you, and I appreciate any suggestions you can give!

@baudm
Copy link
Owner

baudm commented Apr 14, 2023

Is this a suitable training image?

Yes, this works. Just be careful about data augmentation. You might want to reduce the magnitudes first.

If I have ~3500 of images like these, both in English and Russian, how much synthetic data should I augment this with? Or do I need more real data too?

Real data is much better. Or rather, the closer the training data distribution is to the test data distribution, the better. Try using the pretrained weights, at least for the encoder.

My charset is ~160 what should I set the embedding dimension too? Is 384 large enough?

The depth of the encoder has a much bigger effect on model performance compared to the embedding dimension. But if you can use a larger number, use it. Larger models are easier to work with in the experimentation phase.

@eschaffn
Copy link
Author

eschaffn commented Apr 14, 2023

Real data is much better. Or rather, the closer the training data distribution is to the test data distribution, the better. Try using the pretrained weights, at least for the encoder.

How do I use just the encoder pretrained weights?
I was thinking of using ~10M synthetic images generated with (https://github.com/clovaai/synthtiger) does that seem sufficient?

The depth of the encoder has a much bigger effect on model performance compared to the embedding dimension. But if you can use a larger number, use it. Larger models are easier to work with in the experimentation phase.

What do you suggest setting the depth and embedding dimension to?

@baudm
Copy link
Owner

baudm commented Apr 28, 2023

How do I use just the encoder pretrained weights?

Take a look at the examples for finetuning with PyTorch. In a nutshell, you load the model and discard the layers you want to replace (in this case, the decoder).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants