-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Same document, different PDF files, same curl command, predictably different output. #1135
Comments
Hi @haykharut, The PDF format allow to inject any type of information, including fonts, images. Images may be embedded as bitmap or as vectorial. Now, although the PDF document looks good, they often smell bad :-) There are other differences in these two documents, for example, paper_big has some hidden content: which is not present in the paper_small: |
@lfoppiano thanks so much for getting back. If you don't mind, I would like to ask a couple follow up questions. Just to make sure I understand -- is it correct to say that in all likelihood, the larger file represents some figures as vectors and others as bitmaps? In that case, I wonder, how can I extract the coordinates for vectors when bitmaps are missing? Somewhat bewilderingly, the Grobid HF space processes the larger PDF file correctly. For example, the underlined At the same time, the XML file generated by the curl command I mentioned above, references no |
I have 2 PDF versions of a paper, which look exactly the same when inspected visually. The only difference I can detect is file size (2.2MB vs 900KB) and the fact that my PDF viewer will show a contents bar for the big file but not the small file. I am no PDF expert.
I process both files with the command below.
The XML outputs differ. Specifically, GROBID will correctly output
<graphic coords=... type='bitmap'>
for all figures in the small file while it outputs the graphic coords for only 1 figure in the large file, even though it still detects the figures correctly. I am attaching the files for reproducibility.I would appreciate if someone could help me understand why this happens or at least help me get started with an investigation.
paper_big.pdf
paper_small.pdf
The text was updated successfully, but these errors were encountered: