Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A sentence from columns #95

Open
vanushkin opened this issue Oct 8, 2021 · 4 comments
Open

A sentence from columns #95

vanushkin opened this issue Oct 8, 2021 · 4 comments

Comments

@vanushkin
Copy link

Dear developers,
I'm having a following issue: when processing pdfs that have text formatted in columns I'm getting a sentence that consists of several lines combined from those columns. It just makes a mess out of text. Is there any solution to this problem? Or a hint how I can retain the structure of initial text?

@MarcinKosinski
Copy link

@vanushkin please look at tabulizer R package that deals with it

@aourednik
Copy link

@MarcinKosinski I would love to try this solution, but tabulizer has been removed from CRAN and it has a java jar dependency whose execution is blocked by default on the computers in my office. No chance to have the sysadmins unblock it.
When I export a well-formed pdf "as txt" from Adobe Acrobat, the text-flow is respected despite there being 2 columns. There must be something in the PDF inner markup that identifies the text flow. Couldn't pdftools get the text flow from that information?

@jeroen
Copy link
Member

jeroen commented Jul 3, 2023

Actually this is not stored in the pdf inner markup: https://ropensci.org/blog/2018/12/14/pdftools-20
I think the tabulizer tries to guess the layout of columns and tables based on whitespace.

@aourednik
Copy link

@jeroen I've tried with a PDF file generated by Illustrator (see attached file). Despite the layout's relative complexity, Acrobat recognizes the order of the frames I've defined. This flow order must be stored somewhere, otherwise this would not be possible. Acrobat cannot just guess this on the fly.

Perhaps some inner markup elements specific to Acrobat products?

image image

test-text-flow.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants