Skip to content
This repository has been archived by the owner on Jan 20, 2021. It is now read-only.

Helping tabula find the top of a table - column heading cribs? #112

Open
psychemedia opened this issue May 24, 2016 · 2 comments
Open

Helping tabula find the top of a table - column heading cribs? #112

psychemedia opened this issue May 24, 2016 · 2 comments

Comments

@psychemedia
Copy link

When parsing large documents with tables placed in arbitrary locations on a page, I wonder if it would useful to help Tabula get its eye in as to the location of a table by giving it one or more keywords that you expect to see, or require, in the table column headings?

So for example we might provide a set of required heading tokens (Date, Region) that must appear in a tokenised set generated from words in guessed at column headings to help identify a particular table or sort of table, or a set of possible heading tokens that we know often appear in the headings of tables we want to extract, though we're also open to Tabula extracting other things it thinks are tables?

@jeremybmerrill
Copy link
Member

That's a great idea...

@Darkvater
Copy link

I wonder if this has gotten anywhere? I'm writing a bank-statement parser however the table detection can be very fickle. In essence I extract the whole page and unfortunately Tabula doesn't always find the tables so depending on the contents it will group one-or-more columns together making it really difficult to work with the data.

I was just thinking of doing the something very similar in my app as suggested above:

  • extract everything from the page
  • find the keywords that mark the start/end of a table
  • rerun the table extraction process using just those coordinates as start/end (hoping that it will now work). As I only have tables that span the whole page I will only be using the Y-coordinates

I have empirically verified that this works with a few examples using the Tabula UI so I think I will give it a try, however if it already exists, or people have better ideas I would be delighted to hear.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants