Extracting table with vertical texts give unreadable result #942

Dragon2fly · 2023-07-22T02:35:46Z

Describe the bug

Table extraction with vertical header texts returned unreadable string or reverted order.

Have you tried repairing the PDF?

Yes. The problem is still there

Code to reproduce the problem

import pdfplumber

pdf = pdfplumber.open(r"tests\pdf_samples\camelot\agstat.pdf", repair=True)
p0 = pdf.pages[0]
# im = p0.to_image()
# im.debug_tablefinder()
# im.show()
table = p0.extract_table()
for line in table:
    print(line)

PDF file

agstat.pdf

Expected behavior

The vertical text in the red box should be extracted correctly.

Actual behavior

It returned unreadable text for the first row:

['Sl.\nNo.', 'District', 'noitalupoP\n31-2102\n)shkal\ndetcejorP\nnI(\nrof', '%88\not )shkal\ntludA\ntnelaviuqE\nnI(', ')yad/tluda/smg004\nnoitpmusnoC\n)sennot\ntnemeriuqer\nhkaL\nlatoT\nnI(\n@(', 'tnemeriuqeR ,sdees )egatsaw )sennot\ngnidulcnI(\nhkaL\n&\nsdeef\nlatoT\nnI(', 'Production (Rice)\n(In Lakh tonnes)', None, None, 'Surplus/Defi cit\n(In Lakh\ntonnes)', None]

And returned reversed text of the second row

[None, None, None, None, None, None, 'firahK', 'ibaR', 'latoT', 'eciR', 'yddaP']

Screenshots

The table outline is still detected correctly

Environment

pdfplumber version: 0.10.1
Python version: [e.g., 3.10]
OS: Windows 10

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-07-22T08:50:08Z

You can try modifying the default text extraction options e.g.

page.extract_table(dict(text_vertical_ttb=False))

[['Sl.\nNo.',
  'District',
  'Population\n2012-13\nlakhs)\nProjected\n(In\nfor',
  '88%\nto lakhs)\nAdult\nEquivalent\n(In',
  '400gms/adult/day)\nConsumption\ntonnes)\nrequirement\nLakh\nTotal\n(In\n(@',
  'Requirement seeds, wastage) tonnes)\n(Including\nLakh\n&\nfeeds\nTotal\n(In',
  'Production (Rice)\n(In Lakh tonnes)',
  None,
  None,
  'Surplus/Defi cit\n(In Lakh\ntonnes)',
  None],
 [None,
  None,
  None,
  None,
  None,
  None,
  'Kharif',
  'Rabi',
  'Total',
  'Rice',
  'Paddy']]
...

Dragon2fly · 2023-07-22T20:17:51Z

Hi @cmdlineluser

Thank you for your suggestion. It worked!
But I don't see the param text_vertical_ttb mentioned anywhere in the README.md.
Are you planning to turn it on/off this feature automatically?

cmdlineluser · 2023-07-22T20:50:30Z

They are mentioned in the description of the .extract_words() method.

The parameters horizontal_ltr and vertical_ttb indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words).

With regards to plans, I'm just a fellow pdfplumber user.

That would probably be a question for @jsvine

jsvine · 2023-07-23T22:26:26Z

Thanks for your help here, @cmdlineluser!

@Dragon2fly, it's helpful to hear your confusion. To know about text_vertical_ttb, you would have had to jump between a few different parts of the README.md file. I'll aim to add better documentation of the text-related methods soon.

Are you planning to turn it on/off this feature automatically?

I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?

Dragon2fly · 2023-07-24T14:31:06Z

Hi @jsvine,

I don't plan on making any major changes to this parameter or its availability. Does that answer your question, or have I misunderstood it?

From a user experience perspective, the fewer parameters that need to be configured the better. So I just wonder if there is a way to detect the text orientation and just extract it correctly.

Anyway, even though the text_vertical_ttb did help reverse the text correctly,
but for the multi-line vertical text, the output still messed up text from different lines: Population\n2012-13\nlakhs)\nProjected\n(In\nfor

The correct one should be Projected Population\nfor 2012-13\n(In lakhs).
I tried use_text_flow=True but it didn't help either.

Any suggestion?

jsvine · 2023-07-25T15:24:02Z

@Dragon2fly Thank you for clarifying. At the moment, adding automatic text-direction detection isn't on my roadmap, due to the likely large number of edge-cases, and my preference to keep extraction "predictable" and parameters explicit. But I appreciate the suggestion and will keep your use-case in mind.

Re. lines merging: Try decreasing the text_y_tolerance setting to 0 (or even a negative number). Does that help?

Dragon2fly · 2023-07-30T13:24:46Z

Hi @jsvine. Thanks for your suggestion.
But setting text_y_tolerance to 0 or -1 didn't help.
There should be other ways to solve this problem.

jsvine · 2023-08-03T14:07:22Z

Thank you @Dragon2fly. Looking into this, there may be a bug in how pdfplumber handles bottom-to-top text. I will investigate and hope to find a fix.

afriedman412 · 2023-10-27T17:32:28Z

this rhymes with #942

going to work on it

Dragon2fly added the bug label Jul 22, 2023

jsvine closed this as completed Jul 23, 2023

jsvine reopened this Aug 3, 2023

jsvine self-assigned this Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting table with vertical texts give unreadable result #942

Extracting table with vertical texts give unreadable result #942

Dragon2fly commented Jul 22, 2023

cmdlineluser commented Jul 22, 2023 •

edited

Loading

Dragon2fly commented Jul 22, 2023

cmdlineluser commented Jul 22, 2023

jsvine commented Jul 23, 2023

Dragon2fly commented Jul 24, 2023

jsvine commented Jul 25, 2023

Dragon2fly commented Jul 30, 2023

jsvine commented Aug 3, 2023

afriedman412 commented Oct 27, 2023

Extracting table with vertical texts give unreadable result #942

Extracting table with vertical texts give unreadable result #942

Comments

Dragon2fly commented Jul 22, 2023

Describe the bug

Have you tried repairing the PDF?

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

cmdlineluser commented Jul 22, 2023 • edited Loading

Dragon2fly commented Jul 22, 2023

cmdlineluser commented Jul 22, 2023

jsvine commented Jul 23, 2023

Dragon2fly commented Jul 24, 2023

jsvine commented Jul 25, 2023

Dragon2fly commented Jul 30, 2023

jsvine commented Aug 3, 2023

afriedman412 commented Oct 27, 2023

cmdlineluser commented Jul 22, 2023 •

edited

Loading