Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document layout analysis - superscript / subscript #806

Open
jamesanastasi opened this issue Mar 20, 2024 · 1 comment
Open

Document layout analysis - superscript / subscript #806

jamesanastasi opened this issue Mar 20, 2024 · 1 comment

Comments

@jamesanastasi
Copy link

jamesanastasi commented Mar 20, 2024

I'm having a bit of difficulty with this particular use case :

When a line has superscript the line extraction tends to extract the superscript word as a new line. this is bothersome because the word ends up in the wrong place in the raw text.
exemple : from the example PDF

TestPDF5.pdf

Integer egestas tristique aliquet. Sed consequat massa non vehicula finibus

is interpreted
B1 : aliquet. Sed consequat massa non vehicula finibus
B2: Integer egestas tristique

So the raw text is :

aliquet. Sed consequat massa non vehicula finibus Integer egestas tristique

I have adjusted the DocstrumBoundingBoxes parameters : BetweenLineMultiplier to .75 and I get the words in the right order
B1 : Integer egestas tristique
B2 : aliquet. Sed consequat massa non vehicula finibus

but this creates a new probleme :

The two blocs at the end : where each a bloc has two lines ..

Sed a felis fringilla,                           Praesent elementum in enim
maximus libero sit amet.                  id sagittis.

After changing the parameters to make the superscript they are split up into séperate blocs ( and therefore loose their order)

B1 : Sed a felis fringilla,
B2 : Praesent elementum in enim
B3 : maximus libero sit amet.
B4 : id sagittis.

I've tried different variations of recursive XYCut and played with the ordred blocs but can't seem to find the softspot where I get the blocs and the right order.

Any suggestions or ideas would be appreciated

@davebrokit
Copy link
Contributor

davebrokit commented May 30, 2024

Feels you need to use the reading order detectors:

https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis#reading-order-detectors

You might need a higher T parameter to the reading order detector than 5.

If that doesn't work out of the box look at the code in UnsupervisedReadingOrderDetector.cs and IntervalRelationsHelper.cs. You can use that to create your own reading order detector where you use the relationship between a superscript box and a normal text box to impose a reading order.

If you get it working it might be worth raising a PR with an update to UnsupervisedReadingOrderDetector :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants