How to adjust line thickness (aka parser seeing empty cells)? #1092

enrac5 · 2024-02-08T18:55:23Z

enrac5
Feb 8, 2024

I'm trying to parse a PDF (see attached)
Test_R1.pdf

def process_PDF(pdf_path: str): with pdfplumber.open(ccsl_pdf_path) as pdf: for page in pdf.pages: tables = page.extract_tables() for table in tables: for row in table: print(row)

I get a printout like this:

['COMBINED CONTINUITY & DIALOGUE', None, None, None, None, 'MASTER ENGLISH SUBTITLE/SPOTTING LIST', None, None, None, None]
['Shot', 'Footage/\nTimecode', None, 'Shot Description/Dialogue', None, 'Title', 'Start', 'End', 'Dur.', 'Subtitle/Spotting']
['', None, None, None, None, None, None, None, None, None]

Notice the second row, there's a 'None' entry after Timecode. The table debugger shows this image (attached).

Which doesn't immediately explain to me why I'm seeing that output, does anyone have any ideas what I'm doing wrong here? For what it's worth, other PDFs (all word docs printed to PDF) don't seem to behave like this.

jsvine · 2024-02-10T23:25:36Z

jsvine
Feb 10, 2024
Maintainer

Hi @enrac5, and thanks for your interest in this library. Semi-regular tables like this can be a bit difficult to work with at first. The key to understanding it is that there is an entry in each row array for each cell horizontal "position" in the table.

In this case, the right border of the Footage/Timecode border is slightly to the left of the that of the timestamped cells at the bottom of the page.

It's not ideal, and something I'd like to make more intuitive / better-structured in future versions of pdfplumber.

1 reply

enrac5 Feb 28, 2024
Author

Hi there, thanks for the response @jsvine, I was able to work around the issue by manipulating the Word doc this PDF came from, so I think I'm okay for now. Had nasty flashbacks to my desktop publishing days (if you remember QuarkXPress, you'll know my pain).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to adjust line thickness (aka parser seeing empty cells)? #1092

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to adjust line thickness (aka parser seeing empty cells)? #1092

enrac5 Feb 8, 2024

Replies: 1 comment · 1 reply

jsvine Feb 10, 2024 Maintainer

enrac5 Feb 28, 2024 Author

enrac5
Feb 8, 2024

Replies: 1 comment 1 reply

jsvine
Feb 10, 2024
Maintainer

enrac5 Feb 28, 2024
Author