Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract for paper is not correctly extracted from PDF #1155

Open
landryraccoon opened this issue Aug 13, 2024 · 1 comment
Open

Abstract for paper is not correctly extracted from PDF #1155

landryraccoon opened this issue Aug 13, 2024 · 1 comment
Labels
error cases Some error/test case for future improvements models:segmentation

Comments

@landryraccoon
Copy link

Used Docker and Grobid 0.8.0, performing full text extraction from the following PDF:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10125888/pdf/10.1177_23328584231165919.pdf

XML fragment of the abstract and following text returned by Grobid below:

<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>interpret algebraic expressions. This current study, therefore, tested the impacts of three educational technology interventions on algebraic understandin
g among students in Grade 7 across four conditions: (a) From Here to There (FH2T), (b) DragonBox 12+ (DragonBox), (c) Immediate Feedback, and (d) Active Control. The FH2T and DragonBox</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>2 conditions represented use of game-based applications. Immediate Feedback entailed problem sets by using an online homework system, ASSISTments. For the
 purposes of this study, the Active Control condition mimicked traditional homework assignments while still using technology. Although this study independently investigated each of the three treatme
nt conditions to the Active Control condition, it was hypothesized that FH2T, an interactive game developed based on theories of perceptual learning and embodied cognition, might improve students' a
lgebraic understanding through aligning their attention to, actions on, and perception of algebraic notations with high-level mathematical concepts to a greater extent than would DragonBox or Immedi
ate Feedback.</p></div>

It appears that the abstract continues to be returned by Grobid as part of the body, but is not inside the abstract xml node as expected.

@lfoppiano lfoppiano added the error cases Some error/test case for future improvements label Aug 14, 2024
@lfoppiano
Copy link
Collaborator

Hi @landryraccoon, thank you for reporting this issue and the related document.

Indeed, there are two problems here:

  • the whole the abstract is missing, instead, part of the first right column of the body is classified as abstract
  • the first page left column of the body is lost

It seems we can use it for training in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error cases Some error/test case for future improvements models:segmentation
Projects
None yet
Development

No branches or pull requests

2 participants