How to extract structured drilling report data from PDF into JSON using Python? #1351

PentapatiAdarsh · 2025-12-07T19:07:14Z

PentapatiAdarsh
Dec 7, 2025

I’m building a RAG-style application and I want to extract data from PDF reports into a structured JSON format so I can send it directly to an LLM later, without using embeddings.

Right now I’m:

describing the PDF layout in a YAML pattern,

using pdfplumber to extract fields/tables according to that pattern,

saving the result as JSON.

On complex reports (example screenshot/page attached), I’m running into issues keeping the extraction 100% accurate and stable: mis-detected table rows, shifted columns, and occasional missing fields.

My questions:

Are there better approaches or libraries for highly reliable, template-based PDF → JSON extraction?

Is there a recommended way to combine pdfplumber with layout analysis (or another tool) to make this more robust and automatable for RAG ingestion?

Constraints:

Reports follow a fixed layout (like the attached Daily Drilling Report).

I’d like something that can run automatically in a pipeline (no manual labeling).

Any patterns, tools, or example code for turning a fixed-format PDF like this into consistent JSON would be greatly appreciated.
report2.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract structured drilling report data from PDF into JSON using Python? #1351

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to extract structured drilling report data from PDF into JSON using Python? #1351

Uh oh!

PentapatiAdarsh Dec 7, 2025

Replies: 0 comments

PentapatiAdarsh
Dec 7, 2025