Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict of interests missing from xml output #1142

Open
mariadelmarq opened this issue Jul 15, 2024 · 5 comments
Open

Conflict of interests missing from xml output #1142

mariadelmarq opened this issue Jul 15, 2024 · 5 comments
Labels
error cases Some error/test case for future improvements

Comments

@mariadelmarq
Copy link

Hi,

We are looking into using Grobid for a project to look into conflict of interest, funding, and other transparency statements in published articles. These statements are put in different random locations depending on the publisher, sometimes in footnotes, sometimes after that abstract, sometime in the back matter, etc.

For the published pdf for this particular article (not the author manuscript, which is open access, but the actual published pdf by the APA): https://pubmed.ncbi.nlm.nih.gov/27819460/, Grobid does well to extract the funding information from paragraph 4 of the footnote on page 1, but the conflict of interest, contained in paragraph 5 of the same footnote, is missing from the xml output. I suspect perhaps Grobid does not know where to put it in the xml... Is there any chance this has an easy fix?

@lfoppiano lfoppiano added the error cases Some error/test case for future improvements label Jul 15, 2024
@lfoppiano
Copy link
Collaborator

Hi @mariadelmarq, thanks for reporting this problem.

Could you please send me the PDF of this issue and on #1143 at luca AT sciencialab.com?

I'm not able to access them via the pubmed / publisher portal 😅

@mariadelmarq
Copy link
Author

Sent, thanks heaps for looking into it!

@lfoppiano
Copy link
Collaborator

Thanks for sending the files, I'm sorry, I did not have time to check them till now.

Untitled

For the file discussed in this issue, there are two issues:

  1. The header model truncated the funding information, and the part that is missing (near Lee M. Ritterband tagged as <other> is somehow lost). For this I'm not sure it's a bug, because the funding information is correctly covered. As far as I understood, the conflict of interests should not be part of the funding statement as in the grobid approach, or at least for this version of the funding-acknowledgment extraction. I leave this to @kermitt2, for confirmation.
  2. This issue point out an interesting aspect, that there is indeed a need to keep the text that is not classified in the header, which now is kind of lost, and we might want to collect it somewhere in the XML output
  3. There is another issue with the segmentation, as the first paragraph is also missing from the output XML. All the traning data of grobid is limited to CC-BY documents so it's possible that this kind of layout has not received particular attention and training data. Nevertheless, it is possible to create private training data to train grobid for supporting this kind of documents.

@kermitt2
Copy link
Owner

Hello !

Indeed Conflict Of Interest section is not part of the funding section and is considered as a section on its own. However it's not identified explicitly as such by Grobid yet. This is something to do in the future, so extend the segmentation and header models to explicitly recognize COI sections, which is not something complicated I think. I already received this request, COI is more and more common.

About the text lost in the header, what is labeled with other is normally "noise" that we don't want to add to the output (even under a note element). In this example case, it is not working unfortunately, but if we extend the model(s) to cover COI, we can expect a good fix.

@mariadelmarq
Copy link
Author

Thank you both so much for looking into this. For the other articles I'm looking at, Conflict of Interest statements tend to end up in the back matter tag, either one or two divs down, or sometimes within a note tag. Sometimes they do end up in the body, though, which is ok for me, as long as they're somewhere.

@lfoppiano lfoppiano changed the title Section of pdf file is missing from xml output, even though adjacent paragraphs are included Conflict of interests missing from xml output Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error cases Some error/test case for future improvements
Projects
None yet
Development

No branches or pull requests

3 participants