need help with table extraction settings #1071

MalteHuener · 2023-12-22T16:57:27Z

MalteHuener
Dec 22, 2023

pdfplumber: 0.10.3
python: 3.9.0
OS: Mac

Hi there..

Trying my first steps with pdfplumber I need a little bit assistance..

I try to extract the table from the following pdf:
1cropped_test-bwa.pdf

The table body is extracted correctly, but the (centered) header was extracted incorrectly. the % characters are assigned to wrong cells:

these are my table settings:
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "lines"
}

I also experimented with
"snap_tolerance", "snap_x_tolerance", "snap_y_tolerance" but with no success.

Can you help me get the header extracted correctly?

output of the first (header) line:
[['Bezeichnung',
'Mrz/2016 %',
'Ges.- %\nLeistg.',
'Ges.- %\nKosten',
'Pers.-\nKosten',
'Auf-\nschlag',
'Jan/2016 - %\nMrz/2016',
'Ges.- %\nLeistg.',
'Ges.- %\nKosten',
'Pers.-\nKosten',
'Auf-\nschlag']]

but should be this:
[['Bezeichnung',
'Mrz/2016',
'%Ges.- \nLeistg.',
'%Ges.- \nKosten',
'%Pers.-\nKosten',
'Auf-\nschlag',
'Jan/2016 - \nMrz/2016',
'%Ges.- \nLeistg.',
'%Ges.- \nKosten',
'%Pers.-\nKosten',
'Auf-\nschlag'],

the % char is not assigned correctly. It seems that pdfplumber takes the table structure from the table body instead from the first (header) line. Could I adjust/configure this somehow?

MalteHuener · 2023-12-26T07:45:31Z

MalteHuener
Dec 26, 2023
Author

playing around with the table settings, I got it working, but do not really understand it..

setting "text_keep_blank_chars" to True, pdfplumber sepparats the header correctly, but opens up a column between 'Mrz/2016 ' and
'% Ges.- \n Leistg. ' where is no content located. Can you explain this behavior?

table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "lines",
"explicit_vertical_lines": [],
"explicit_horizontal_lines": [],
"snap_tolerance": 5,
"snap_x_tolerance": 3,
"snap_y_tolerance": 3,
"join_tolerance": 3,
"join_x_tolerance": 10,
"join_y_tolerance": 13,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 3,
"text_keep_blank_chars": True,
"text_tolerance": 3,
"text_x_tolerance": 3,
"text_y_tolerance": 3,
"intersection_tolerance": 3,
"intersection_x_tolerance": 3,
"intersection_y_tolerance": 13,
}

playing with "text_tolerance": 3, "text_x_tolerance": 3, "text_y_tolerance": 3 don't leads to the desired result, even if I thought it..

But changing "min_words_vertical" to 5 extracted the table correctly:

Maybe someone explain a little bit more what happened here for a better understanding of pdfplumber the table settings?

And one last question:
Why the lines coming up there from the founded table, are they part of the extracted table?

3 replies

jsvine Jan 7, 2024
Maintainer

Hi @MalteHuener, and thanks for your interest in pdfplumber. Unfortunately, text strategy can indeed be a bit finicky. In your case, the change that made the difference (as determined by trial-and-error commenting out the settings you passed, many of which have no effect on this particular extraction) is this: "text_keep_blank_chars": True,

I.e.,:

page = pdf.pages[0]
im = page.to_image()
im.reset().debug_tablefinder({
    "vertical_strategy": "text",
    "text_keep_blank_chars": True,
})

... produces the following, placing the column divisions to the left of the % characters:

To understand why this is the case, it helps to understand that the "text" strategy examines the left and right edges of each word (rather than each character) on the page. % is considered a separate word from the text that follows it; but since there's a literal space (see im.reset().draw_rects(page.chars)) between them, then text_keep_blank_chars will keep them as one word and extend the word so that it aligns with the rest. (See code and screenshot below.) Together, that helps pdfplumber identify the proper alignment of the columns.

(
    im.reset()
    .draw_rects(page.extract_words(keep_blank_chars=False))
    .draw_rects(page.extract_words(keep_blank_chars=True), stroke="blue", fill=None)
)

And one last question:
Why the lines coming up there from the founded table, are they part of the extracted table?

That is just an artifact of the table-detection "algorithm", not part of the extracted table.

MalteHuener Jan 8, 2024
Author

Hi @jsvine ,

thanks for your warmly welcome, your reply and explanations, which I get.

Now, I have another PDF, with almost same table structure:
p6_1cropped_test-bwa.pdf

with

table_settings = {
"text_keep_blank_chars": True,
"vertical_strategy": "text"
}
I got this result, getting two additional columns I don't know why:

table_settings = {
"text_keep_blank_chars": True,
"vertical_strategy": "text",
"min_words_vertical": 4
}
Here is only one unwanted column left, which I don't understand why, because there had be no words in the column which now is not longer there with the min_word_vertical attribute:

My Questions:

It seems that table extraction / structure finding works on the table content instead of the table header (first line). Can I change this somehow?
As I would like to extract many different tables like that (same structure, but different column width) I would like to do something like this:

Read the table header columns (sometimes 2 rows) with text_keep_blank_chars = true
set the column line in the middle between the found words / table header and read the whole table with this settings

Can you help me or give a hink to get this running?

jsvine Jan 9, 2024
Maintainer

Hi @MalteHuener, notes below.

It seems that table extraction / structure finding works on the table content instead of the table header (first line). Can I change this somehow?

As it happens, PDFs don't have an internal concept of a table header, or even of a table; consequently, pdfplumber doesn't treat that text any differently than the rest. But I think the pattern you're seeing is due to the text in the table body aligning more consistently than it does in (what we perceive to be) the header.

That said, you could examine the page.rects and page.lines items to identify the positioning of the header, and then use page.crop((x0, top, x1, bottom)) to focus on just that part, and use the text there to identify the positioning of columns.

More generally, you could analyze the location of all text on the page to identify the areas where there's a bunch of whitespace (i.e., not text characters) to determine the location of the column divisions, and then use the "explicit_vertical_lines": [...] table settings.

As I would like to extract many different tables like that (same structure, but different column width) I would like to do something like this:

If the tables do share the same structure / general graphical layout, this should be possible with the approach detailed above. (In theory.)

MalteHuener · 2024-01-09T11:01:48Z

MalteHuener
Jan 9, 2024
Author

@jsvine Another question: In my case I have a multi-page PDF with up to 50 table pages that can be arranged in different ways. If I previously figured out the correct table_settings for each table, how can I dynamically apply the correct table_settings? Do you have any other tips?

1 reply

jsvine Jan 9, 2024
Maintainer

I'm not sure I understand this question 100%, but if you've identified successful table settings, you can pass the same parameters to .extract_tables(...)/etc. on any other page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need help with table extraction settings #1071

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

need help with table extraction settings #1071

MalteHuener Dec 22, 2023

Replies: 2 comments · 4 replies

MalteHuener Dec 26, 2023 Author

jsvine Jan 7, 2024 Maintainer

MalteHuener Jan 8, 2024 Author

jsvine Jan 9, 2024 Maintainer

MalteHuener Jan 9, 2024 Author

jsvine Jan 9, 2024 Maintainer

MalteHuener
Dec 22, 2023

Replies: 2 comments 4 replies

MalteHuener
Dec 26, 2023
Author

jsvine Jan 7, 2024
Maintainer

MalteHuener Jan 8, 2024
Author

jsvine Jan 9, 2024
Maintainer

MalteHuener
Jan 9, 2024
Author

jsvine Jan 9, 2024
Maintainer