-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression against v1.0.2: scientific notation and text element positioning #526
Comments
The character encoding are an Excel issue, not a Tabula issue. For the merged columns, you may need to explicitly specify an extraction method and/or make sure your extraction regions are identical to in the GUI version, for instance with the bash file output. |
I should have clarified more. I am using the stream method in both cases. Using the lattice method only outputs no data. Also, they are indeed both using the same exact areas. For the character stuff, I am not worried about how excel is parsing the data, sorry for the miscommunication. I am more concerned about how tabula-java is not picking up the scientific notation fully and the merging columns, as I mentioned. |
At the end of the day, the GUI is a front-end for the command-line version. (And the CLI version exists for automated pipelines, which I've implemented many of. So, this oughta work.) I can't really offer much of a theory on the combined columns without seeing the PDF (at least a screenshot of the table), but in general with the stream method, columns get combined when there's some text that spans the two columns. Often headers are the culprit (or footnotes). You might try fuzzing the coordinates a little bit to see if something's being erroneously included. Scientific notation, I don't have any theories, again without seeing more. Have you verified by opening the CSV in a text editor (or Google Sheets, which copes with Unicode in CSVs better than Excel (or, to be precise, Mac Excel)) that the characters are really absent? |
I'm pretty puzzled. Maybe try the previous tabula-java version, 1.0.4? https://github.com/tabulapdf/tabula-java/releases/tag/v1.0.4 or even v1.0.2, which appears to be the version used in the GUI. Possible there was a regression. |
v1.0.2 worked perfectly (just like the GUI version output). v1.0.3 and v1.0.4 have the same undesirable behavior and output as v1.0.5. No idea what's causing this, but thanks for helping me find a workaround. I cannot share the PDF of the article directly, but if you want to look into this further the article is titled "Corrosion behavior of CoxCrCuFeMnNi high-entropy alloys prepared by hot pressing sintered in 3.5% NaCl solution" accessed via ScienceDirect. The table is on page 2. |
Glad v1.0.2 worked. I'm surprised to see this regression. I'm going to retitle this ticket to, at least, eventually theoretically hopefully maybe make this a test case in the test suite. @jazzido curious if you have a sense if this is due to upgrading the PDFBox version? |
The GUI/Webapp version of tabula works (almost) perfectly to grab the tables I need (from a scientific article). However, I am trying to create an automated system and the command line version cannot read certain characters correctly and merges columns. These errors occur with the same table that the GUI version is handling perfectly. I am feeding the command line version the same area that the GUI version is analyzing.
This is the GUI output.
This is the command line version output.
The text was updated successfully, but these errors were encountered: