Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting table with vertical texts give unreadable result #942

Open
Dragon2fly opened this issue Jul 22, 2023 · 9 comments
Open

Extracting table with vertical texts give unreadable result #942

Dragon2fly opened this issue Jul 22, 2023 · 9 comments
Assignees
Labels

Comments

@Dragon2fly
Copy link

Describe the bug

Table extraction with vertical header texts returned unreadable string or reverted order.

Have you tried repairing the PDF?

Yes. The problem is still there

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open(r"tests\pdf_samples\camelot\agstat.pdf", repair=True)
p0 = pdf.pages[0]
# im = p0.to_image()
# im.debug_tablefinder()
# im.show()
table = p0.extract_table()
for line in table:
    print(line)

PDF file

agstat.pdf

Expected behavior

The vertical text in the red box should be extracted correctly.

image

Actual behavior

It returned unreadable text for the first row:

['Sl.\nNo.', 'District', 'noitalupoP\n31-2102\n)shkal\ndetcejorP\nnI(\nrof', '%88\not )shkal\ntludA\ntnelaviuqE\nnI(', ')yad/tluda/smg004\nnoitpmusnoC\n)sennot\ntnemeriuqer\nhkaL\nlatoT\nnI(\n@(', 'tnemeriuqeR ,sdees )egatsaw )sennot\ngnidulcnI(\nhkaL\n&\nsdeef\nlatoT\nnI(', 'Production (Rice)\n(In Lakh tonnes)', None, None, 'Surplus/Defi cit\n(In Lakh\ntonnes)', None]

And returned reversed text of the second row

[None, None, None, None, None, None, 'firahK', 'ibaR', 'latoT', 'eciR', 'yddaP']

Screenshots

The table outline is still detected correctly

image

Environment

  • pdfplumber version: 0.10.1
  • Python version: [e.g., 3.10]
  • OS: Windows 10
@Dragon2fly Dragon2fly added the bug label Jul 22, 2023
@cmdlineluser
Copy link

cmdlineluser commented Jul 22, 2023

You can try modifying the default text extraction options e.g.

page.extract_table(dict(text_vertical_ttb=False))
[['Sl.\nNo.',
  'District',
  'Population\n2012-13\nlakhs)\nProjected\n(In\nfor',
  '88%\nto lakhs)\nAdult\nEquivalent\n(In',
  '400gms/adult/day)\nConsumption\ntonnes)\nrequirement\nLakh\nTotal\n(In\n(@',
  'Requirement seeds, wastage) tonnes)\n(Including\nLakh\n&\nfeeds\nTotal\n(In',
  'Production (Rice)\n(In Lakh tonnes)',
  None,
  None,
  'Surplus/Defi cit\n(In Lakh\ntonnes)',
  None],
 [None,
  None,
  None,
  None,
  None,
  None,
  'Kharif',
  'Rabi',
  'Total',
  'Rice',
  'Paddy']]
...

@Dragon2fly
Copy link
Author

Hi @cmdlineluser

Thank you for your suggestion. It worked!
But I don't see the param text_vertical_ttb mentioned anywhere in the README.md.
Are you planning to turn it on/off this feature automatically?

@cmdlineluser
Copy link

They are mentioned in the description of the .extract_words() method.

The parameters horizontal_ltr and vertical_ttb indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words).

With regards to plans, I'm just a fellow pdfplumber user.

That would probably be a question for @jsvine

@jsvine
Copy link
Owner

jsvine commented Jul 23, 2023

Thanks for your help here, @cmdlineluser!

@Dragon2fly, it's helpful to hear your confusion. To know about text_vertical_ttb, you would have had to jump between a few different parts of the README.md file. I'll aim to add better documentation of the text-related methods soon.

Are you planning to turn it on/off this feature automatically?

I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?

@jsvine jsvine closed this as completed Jul 23, 2023
@Dragon2fly
Copy link
Author

Hi @jsvine,

I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?

From a user experience perspective, the fewer parameters that need to be configured the better. So I just wonder if there is a way to detect the text orientation and just extract it correctly.

Anyway, even though the text_vertical_ttb did help reverse the text correctly,
but for the multi-line vertical text, the output still messed up text from different lines: Population\n2012-13\nlakhs)\nProjected\n(In\nfor

The correct one should be Projected Population\nfor 2012-13\n(In lakhs).
I tried use_text_flow=True but it didn't help either.

Any suggestion?

@jsvine
Copy link
Owner

jsvine commented Jul 25, 2023

@Dragon2fly Thank you for clarifying. At the moment, adding automatic text-direction detection isn't on my roadmap, due to the likely large number of edge-cases, and my preference to keep extraction "predictable" and parameters explicit. But I appreciate the suggestion and will keep your use-case in mind.

Re. lines merging: Try decreasing the text_y_tolerance setting to 0 (or even a negative number). Does that help?

@Dragon2fly
Copy link
Author

Hi @jsvine. Thanks for your suggestion.
But setting text_y_tolerance to 0 or -1 didn't help.
There should be other ways to solve this problem.

@jsvine
Copy link
Owner

jsvine commented Aug 3, 2023

Thank you @Dragon2fly. Looking into this, there may be a bug in how pdfplumber handles bottom-to-top text. I will investigate and hope to find a fix.

@jsvine jsvine reopened this Aug 3, 2023
@jsvine jsvine self-assigned this Aug 3, 2023
@afriedman412
Copy link
Contributor

this rhymes with #942

going to work on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants