need help with table extraction settings #1071
MalteHuener
started this conversation in
Ask for help with specific PDFs
Replies: 2 comments 4 replies
-
Beta Was this translation helpful? Give feedback.
3 replies
-
@jsvine Another question: In my case I have a multi-page PDF with up to 50 table pages that can be arranged in different ways. If I previously figured out the correct table_settings for each table, how can I dynamically apply the correct table_settings? Do you have any other tips? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
pdfplumber: 0.10.3
python: 3.9.0
OS: Mac
Hi there..
Trying my first steps with pdfplumber I need a little bit assistance..
I try to extract the table from the following pdf:
1cropped_test-bwa.pdf
The table body is extracted correctly, but the (centered) header was extracted incorrectly. the % characters are assigned to wrong cells:
these are my table settings:
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "lines"
}
I also experimented with
"snap_tolerance", "snap_x_tolerance", "snap_y_tolerance" but with no success.
Can you help me get the header extracted correctly?
output of the first (header) line:
[['Bezeichnung',
'Mrz/2016 %',
'Ges.- %\nLeistg.',
'Ges.- %\nKosten',
'Pers.-\nKosten',
'Auf-\nschlag',
'Jan/2016 - %\nMrz/2016',
'Ges.- %\nLeistg.',
'Ges.- %\nKosten',
'Pers.-\nKosten',
'Auf-\nschlag']]
but should be this:
[['Bezeichnung',
'Mrz/2016',
'%Ges.- \nLeistg.',
'%Ges.- \nKosten',
'%Pers.-\nKosten',
'Auf-\nschlag',
'Jan/2016 - \nMrz/2016',
'%Ges.- \nLeistg.',
'%Ges.- \nKosten',
'%Pers.-\nKosten',
'Auf-\nschlag'],
the % char is not assigned correctly. It seems that pdfplumber takes the table structure from the table body instead from the first (header) line. Could I adjust/configure this somehow?
Beta Was this translation helpful? Give feedback.
All reactions