How to extract structured drilling report data from PDF into JSON using Python? #1351
Unanswered
PentapatiAdarsh
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I’m building a RAG-style application and I want to extract data from PDF reports into a structured JSON format so I can send it directly to an LLM later, without using embeddings.
Right now I’m:
describing the PDF layout in a YAML pattern,
using pdfplumber to extract fields/tables according to that pattern,
saving the result as JSON.
On complex reports (example screenshot/page attached), I’m running into issues keeping the extraction 100% accurate and stable: mis-detected table rows, shifted columns, and occasional missing fields.
My questions:
Are there better approaches or libraries for highly reliable, template-based PDF → JSON extraction?
Is there a recommended way to combine pdfplumber with layout analysis (or another tool) to make this more robust and automatable for RAG ingestion?
Constraints:
Reports follow a fixed layout (like the attached Daily Drilling Report).
I’d like something that can run automatically in a pipeline (no manual labeling).
Any patterns, tools, or example code for turning a fixed-format PDF like this into consistent JSON would be greatly appreciated.
report2.pdf
Beta Was this translation helpful? Give feedback.
All reactions