Help with table extraction settings when not all columns in table have a vertical line #1134
sfc-gh-svadakath
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @sfc-gh-svadakath, and thanks for your interest in Let us know if not. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I'd like to table extract five columns from Schedule A (pages 5-36) in attached PDF. The challenges is that it only picks the middle three columns due to the left and right most columns not having enclosing vertical line. See screenshot. Is there a way using table settings to extract all five columns for just Schedule A table using pdfplumber? Appreciate any pointers. Thank you
Here's my code snippet looking at just page 5.
`import pdfplumber
import pandas as pd
Set display options to show all rows and columns
pd.set_option('display.max_rows', None) # None means unlimited
pd.set_option('display.max_columns', None) # None means unlimited
Specify the path to your PDF file
pdf_file_path = "/Users/svadakath/Data/AlertPDF/RBC.pdf"
pdf = pdfplumber.open(pdf_file_path)
table=pdf.pages[5].extract_table()
pd.DataFrame(table[1::],columns=table[1])
table_settings = {
"vertical_strategy": "lines",
"horizontal_strategy": "lines"
}
pdf = pdfplumber.open(pdf_file_path)
table=pdf.pages[5].extract_table(table_settings)
pd.DataFrame(table[1::],columns=table[0])`
RBC.pdf
Beta Was this translation helpful? Give feedback.
All reactions