This repository has been archived by the owner on Jan 20, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 57
Helping tabula find the top of a table - column heading cribs? #112
Comments
That's a great idea... |
I wonder if this has gotten anywhere? I'm writing a bank-statement parser however the table detection can be very fickle. In essence I extract the whole page and unfortunately Tabula doesn't always find the tables so depending on the contents it will group one-or-more columns together making it really difficult to work with the data. I was just thinking of doing the something very similar in my app as suggested above:
I have empirically verified that this works with a few examples using the Tabula UI so I think I will give it a try, however if it already exists, or people have better ideas I would be delighted to hear. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
When parsing large documents with tables placed in arbitrary locations on a page, I wonder if it would useful to help Tabula get its eye in as to the location of a table by giving it one or more keywords that you expect to see, or require, in the table column headings?
So for example we might provide a set of required heading tokens (Date, Region) that must appear in a tokenised set generated from words in guessed at column headings to help identify a particular table or sort of table, or a set of possible heading tokens that we know often appear in the headings of tables we want to extract, though we're also open to Tabula extracting other things it thinks are tables?
The text was updated successfully, but these errors were encountered: