-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Vision Grid Transformer for Document Layout Analysis #100
Comments
I'm working on a similar project and am excited to see that you have already started. I'm curious about your progress. If needed, I can offer my help. |
That would be great, I have started looking at Advanced Literate Machinery. I was not able to obtain the weight to test the model, but it does looks very good. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Implement Vision Grid Transformer for Document Layout Analysis
AlibabaResearch recently published a new model for Document Layout Analysis which sets a new benchmark in the task of Document Layout Analysis.
Introduction - To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding
https://arxiv.org/abs/2308.14978
Effect on LLM usage - VGT can dissect the page into different portions (headers, subheaders, titles, etc.) which can then be OCRed and passed to an LLM for RAG.
The text was updated successfully, but these errors were encountered: