Command line option that returns detected areas #11

soupgrey · 2013-06-20T07:09:20Z

Feature request.

I'd love to see an command line option to get informations about rectangles found by TableGuesser. A "dry run mode" to see what portion of PDF tabula-extractor will be processing.

jeremybmerrill · 2013-06-20T14:34:49Z

Curious, @soupgrey , would you parse the output to use programmatically, or just for yourself to check before running the admittedly-time-consuming process?

In either case, can you suggest how sample output would look? e.g., for human-readable output, are you thinking something like this?

 [x, y, w, h]
 Found 2 rectangles on page 1:
 [100, 100, 400, 400]
 [100, 600, 100, 100]
 Found 1 rectangles on page 2:
 [100, 100, 400, 400]

soupgrey · 2013-06-20T14:47:53Z

I would like use it for debugging strange tabula output. Sometimes pdf table extraction does not work perfectly. (in most cases when dealing with multi-line cells or bad quality pdf files).

I would like to review what TableGuesser treats as table and if it is a bad choice reprocess pdf with manually selected area. Then I could see how much I can trust TableGuesser to automate PDF processing :)

Perfered output would be CSV format like:

page number, x, y, w, h
or
page number, [x, y, w, h]

BTW - Great software :)

jeremybmerrill · 2013-06-20T15:12:15Z

Thanks. It's definitely the case that table detection is imperfect. We're also definitely working on it, so if you get errors, debug output is helpful.

Since this isn't a bug, I'm not going to fix it right now, but this is a easy and doable feature request.

soupgrey · 2013-06-25T08:47:32Z

Another option in debugging would be useful for me. It would be great if tabula was able to return location of extracted row text. Something like:

column 1 text, column 2 text, page_number, [x, y, w, h]

Having this information it would be possible to run test of area or page extraction coverage. It would be possible to check how much data was recognized and extracted.
In example:
Manually - show user image representation of PDF page with overlay marking which regions was extracted. User could manually verify whether any important rows left on page unrecognized.
Automatically - on image of a PDF page clear all extracted row areas (e.g. fill with white rectangles) and check if there are any non white-space pixel groups in extraction area (i.e. passed as rectangle in command line or detected by TableGuesser).

Don't know how reasonable it sound to you, but for me it could improve automation and verification.
I would like to be able to prove existence of extracted row in original data source. In scenario that user browses through extracted rows and when he is wanting to see original context can get an image of original PDF page with this row highlighted.

Add an empty dot file to pdfs directory

jeremybmerrill · 2014-01-07T04:13:56Z

@soupgrey, you should check out the debug output in Cell (with debug level set to SUPERDEBUG). https://github.com/jazzido/tabula-extractor/blob/pre07/lib/tabula/entities/cell.rb

It should do some of what I think you're asking for. This output isn't available for all extractions (just "spreadsheet" method ones), but eventually it should be. I'd love to hear your feedback

soupgrey mentioned this issue Jun 21, 2013

Command line parameter --debug for printing guessed areas #12

Merged

noahpryor pushed a commit to noahpryor/tabula-extractor that referenced this issue Jul 16, 2013

Merge pull request tabulapdf#11 from jazzido/add-empty-pdfs-dir

8e96039

Add an empty dot file to pdfs directory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Command line option that returns detected areas #11

Command line option that returns detected areas #11

soupgrey commented Jun 20, 2013

jeremybmerrill commented Jun 20, 2013

soupgrey commented Jun 20, 2013

jeremybmerrill commented Jun 20, 2013

soupgrey commented Jun 25, 2013

jeremybmerrill commented Jan 7, 2014

Command line option that returns detected areas #11

Command line option that returns detected areas #11

Comments

soupgrey commented Jun 20, 2013

jeremybmerrill commented Jun 20, 2013

soupgrey commented Jun 20, 2013

jeremybmerrill commented Jun 20, 2013

soupgrey commented Jun 25, 2013

jeremybmerrill commented Jan 7, 2014