GROBID Inconsistent Reference Detection in Custom PDFs: Format Guidelines Needed #1154

JosVuHuynh · 2024-08-12T09:50:52Z

What is the correct format for a PDF file that GROBID can detect references in? I create PDFs myself, and sometimes they work and sometimes they don’t. I’m not sure about the formatting rules. Can you please let me know?

lfoppiano · 2024-08-12T14:55:57Z

With "detect references" do you mean, detect reference callout (e.g. In previous work [1] we showed that...)? or references sections in the article?

For the first case, there is generally not much training data in grobid (Fulltext model), but maybe it's easier if you show me some examples of your generated documents.

JosVuHuynh · 2024-08-13T01:29:37Z

GwptVMUJQT.pdf
T5D17Q7WMj.pdf
besG09DFZb.pdf
CsoUOcdybT.pdf
Could you review all files @lfoppiano ? Grobid not detect ref when I run on https://huggingface.co/spaces/kermitt2/grobid .|
It related issues: #1152

I would like to know the formatting rules I need to follow when creating a new article PDF so that GROBID can accurately detect citations.

lfoppiano · 2024-08-13T03:34:21Z

There are no "rules" to format a document so that Grobid recognise the references. It's more like, to make a document like a scientific article.
At a first glance, these document' format is a bit far from the layout of a scientific article. For example, there is no header (at least title and authors) and the page layout is also horizontal (landscape).

Then, most important, the references don't match the text, so is normal that Grobid does not extract them correctly.

I did adjust your document and now with some more consistency looks much better ;-) Although, the body look indeed like an abstract:
Untitled.pdf
Untitled.pdf.tei.xml.zip

lfoppiano added bug From Hemiptera and especially its suborder Heteroptera question There's no such thing as a stupid question and removed bug From Hemiptera and especially its suborder Heteroptera labels Aug 13, 2024

lfoppiano closed this as completed Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GROBID Inconsistent Reference Detection in Custom PDFs: Format Guidelines Needed #1154

GROBID Inconsistent Reference Detection in Custom PDFs: Format Guidelines Needed #1154

JosVuHuynh commented Aug 12, 2024

lfoppiano commented Aug 12, 2024

JosVuHuynh commented Aug 13, 2024 •

edited

Loading

lfoppiano commented Aug 13, 2024

GROBID Inconsistent Reference Detection in Custom PDFs: Format Guidelines Needed #1154

GROBID Inconsistent Reference Detection in Custom PDFs: Format Guidelines Needed #1154

Comments

JosVuHuynh commented Aug 12, 2024

lfoppiano commented Aug 12, 2024

JosVuHuynh commented Aug 13, 2024 • edited Loading

lfoppiano commented Aug 13, 2024

JosVuHuynh commented Aug 13, 2024 •

edited

Loading