Skip to content
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.

Command line option that returns detected areas #11

Open
soupgrey opened this issue Jun 20, 2013 · 5 comments
Open

Command line option that returns detected areas #11

soupgrey opened this issue Jun 20, 2013 · 5 comments

Comments

@soupgrey
Copy link
Contributor

Feature request.

I'd love to see an command line option to get informations about rectangles found by TableGuesser. A "dry run mode" to see what portion of PDF tabula-extractor will be processing.

@jeremybmerrill
Copy link
Member

Curious, @soupgrey , would you parse the output to use programmatically, or just for yourself to check before running the admittedly-time-consuming process?

In either case, can you suggest how sample output would look? e.g., for human-readable output, are you thinking something like this?

 [x, y, w, h]
 Found 2 rectangles on page 1:
 [100, 100, 400, 400]
 [100, 600, 100, 100]
 Found 1 rectangles on page 2:
 [100, 100, 400, 400]

@soupgrey
Copy link
Contributor Author

I would like use it for debugging strange tabula output. Sometimes pdf table extraction does not work perfectly. (in most cases when dealing with multi-line cells or bad quality pdf files).

I would like to review what TableGuesser treats as table and if it is a bad choice reprocess pdf with manually selected area. Then I could see how much I can trust TableGuesser to automate PDF processing :)

Perfered output would be CSV format like:

page number, x, y, w, h
or
page number, [x, y, w, h]

BTW - Great software :)

@jeremybmerrill
Copy link
Member

Thanks. It's definitely the case that table detection is imperfect. We're also definitely working on it, so if you get errors, debug output is helpful.

Since this isn't a bug, I'm not going to fix it right now, but this is a easy and doable feature request.

@soupgrey
Copy link
Contributor Author

Another option in debugging would be useful for me. It would be great if tabula was able to return location of extracted row text. Something like:

column 1 text, column 2 text, page_number, [x, y, w, h]

Having this information it would be possible to run test of area or page extraction coverage. It would be possible to check how much data was recognized and extracted.
In example:
Manually - show user image representation of PDF page with overlay marking which regions was extracted. User could manually verify whether any important rows left on page unrecognized.
Automatically - on image of a PDF page clear all extracted row areas (e.g. fill with white rectangles) and check if there are any non white-space pixel groups in extraction area (i.e. passed as rectangle in command line or detected by TableGuesser).

Don't know how reasonable it sound to you, but for me it could improve automation and verification.
I would like to be able to prove existence of extracted row in original data source. In scenario that user browses through extracted rows and when he is wanting to see original context can get an image of original PDF page with this row highlighted.

noahpryor pushed a commit to noahpryor/tabula-extractor that referenced this issue Jul 16, 2013
Add an empty dot file to pdfs directory
@jeremybmerrill
Copy link
Member

@soupgrey, you should check out the debug output in Cell (with debug level set to SUPERDEBUG). https://github.com/jazzido/tabula-extractor/blob/pre07/lib/tabula/entities/cell.rb

It should do some of what I think you're asking for. This output isn't available for all extractions (just "spreadsheet" method ones), but eventually it should be. I'd love to hear your feedback

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants