You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @shula Unfortunately this is expected behavior for a PDF with this kind of problem. The "extra"/unexpected characters (for example AL YPT in line 1068) are present, but under the text for the next cell to the left. So Tabula is correctly extracting the characters.
When 2 of the cells in the PDF continue beyond the cell's boundary, the next cell's content goes "crazy" (i.e. is totally different than expected)
in the example sample:
I assume the PDF source is EXCEL, where it's common to see long text cut at the border of the cell. I don't know for sure.
(The document is RTL, ie. right to left; therefore, the 2nd cell is the 2nd from the right)
Command line used:
java -Dfile.encoding=UTF8 -jar tabula-1.0.5-jar-with-dependencies.jar sample.pdf -f TSV > sample.tsv
The bogus lines are identified / starts with: 1068, 1103
Output lines with the problem:
43 E2U9 A10L YCPCT "ש""א אקליפטוס סיטריאדורה SCITRIADORA/" 1068
60 43 10 CEUCC "ש""א אקליפטוס רדיאטה LYPTUSRADIATA/" 1103
In the output, i see 2 phenomena:
in the attache sample.df > converted text file in the 3rd field shoud've been the text "10 CC".
My setup:
The text was updated successfully, but these errors were encountered: