when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

shula · 2024-02-20T17:11:51Z

When 2 of the cells in the PDF continue beyond the cell's boundary, the next cell's content goes "crazy" (i.e. is totally different than expected)

in the example sample:

I assume the PDF source is EXCEL, where it's common to see long text cut at the border of the cell. I don't know for sure.

PDF file (download)
TSV output file
(The document is RTL, ie. right to left; therefore, the 2nd cell is the 2nd from the right)

Command line used:
java -Dfile.encoding=UTF8 -jar tabula-1.0.5-jar-with-dependencies.jar sample.pdf -f TSV > sample.tsv

The bogus lines are identified / starts with: 1068, 1103
Output lines with the problem:
43 E2U9 A10L YCPCT "ש""א אקליפטוס סיטריאדורה SCITRIADORA/" 1068
60 43 10 CEUCC "ש""א אקליפטוס רדיאטה LYPTUSRADIATA/" 1103

In the output, i see 2 phenomena:

the wrong text "A10L YCPCT" should've been: "10 CC"
the wrong text "E209" should've been: "29". etc.
the word "EUCALIPTUS" is cut in these lines. This makes sense, since it's not visible, and therefore, not a real bug.

in the attache sample.df > converted text file in the 3rd field shoud've been the text "10 CC".

My setup:

windows 10
java version "1.8.0_401"
tabula 1.0.5

The text was updated successfully, but these errors were encountered:

jeremybmerrill · 2024-02-20T18:33:44Z

Hi @shula Unfortunately this is expected behavior for a PDF with this kind of problem. The "extra"/unexpected characters (for example AL YPT in line 1068) are present, but under the text for the next cell to the left. So Tabula is correctly extracting the characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

shula commented Feb 20, 2024

jeremybmerrill commented Feb 20, 2024

when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

Comments

shula commented Feb 20, 2024

in the example sample:

In the output, i see 2 phenomena:

My setup:

jeremybmerrill commented Feb 20, 2024