Wrong symbol bounding box coordinates #2024

vidiecan · 2018-10-23T13:13:14Z

I am posting it here because it might be related to #1712 and #1192. I think it is questionable what should be the solution because some might prefer the current behaviour.
Can be closed any time.

Behaviour: doing OCR using LSTM with specific model returns invalid character bounding box when calling PageIterator::BoundingBox.

Reproducibility: very likely not because of the missing model.

Input image (without the coloured lines):

Before LSTM, fake words are created that contain correct blobs based on outlines. After LSTM, RecodeBeamSearch::InitializeWord, blobs are computed here

tesseract/src/lstm/recodebeam.cpp

Line 432 in 9c2d1aa

for (int i = word_start; i < word_end; ++i) {

using x positions based on the timestep. Now, the start x position where the model started to recognise 2 (plus an estimated window) is the start of the green line in the picture. It can be argued that it starts too early and stops too soon (very roughly said like before seeing the end of 2). Moreover, the computed width span is reduced here

tesseract/src/lstm/recodebeam.cpp

Line 435 in 9c2d1aa

min_half_width = xcoords[i] - xcoords[i - 1];

Then, when constructing the symbol bounding box in PAGE_RES_IT::ReplaceCurrentWord this condition is not met

tesseract/src/ccstruct/pageres.cpp

Line 1384 in 5fdaa47

src_b_it.data()->bounding_box().x_middle() < end_x) {

because the computed end_x is too far to the left. For the record, the ends are computed at

tesseract/src/ccstruct/pageres.cpp

Line 1313 in 5fdaa47

blob_end = (blob_box.right() + blob_it.data()->bounding_box().left()) / 2;

The result is that the bbox is "unitialised" (max int values for left, -max int for right).

The code that should handle this situation is below at

tesseract/src/ccstruct/pageres.cpp

Line 1407 in 5fdaa47

for (int i = 0; i < box_word->length(); ++i) {

However, it fails too because the cblobs are further to right than the wrongly computed blob end.

The unitialised bounding box gets "fixed" when calling BoundingBox at

tesseract/src/ccmain/pageiterator.cpp

Line 313 in 5fdaa47

*left = ClipToRange(static_cast<int>(box.left()), 0, pix_width);

setting the coordinates to 0, 0, image height/width.

The text was updated successfully, but these errors were encountered:

amitdo · 2018-10-23T13:45:24Z

Do you have a fix that will work well, even in the presence of ligatures and noise?
Also, it should work for all supported langs.

vidiecan · 2018-10-23T13:46:43Z

@amitdo I do not, I would have sent a PR if I had :)

zdenop · 2019-07-01T17:03:00Z

Is this issue fixed in 4.1/current code?

stweil · 2019-07-03T05:31:46Z

@noahmetzger, do you get better character boxes for this example with your latest code?

noahmetzger · 2019-07-03T07:15:06Z

I think so. Will test this later.

StephenRUK · 2019-07-12T08:42:13Z

I would also be curious if this is fixed. Working with v4.0 there is often trouble with wider characters, which then causes an offset of all following character bounding boxes. Any news above?

zdenop · 2019-07-12T08:54:31Z

Did you try 4.1.0?

StephenRUK · 2019-07-12T08:58:29Z

I tried 4.1.0-rc1 at the time, with slightly different but not fixed results. Ok so 4.1.0 is a definite release now, great thanks! I will give the Windows installer a try in combination with the tesserocr Python package if possible.

stweil · 2019-07-16T10:20:38Z

This should be fixed by pull request #2576.

Shreeshrii · 2019-07-19T10:09:35Z

tesseract wrong.png  - -l eng  --tessdata-dir ~/tessdata_fast  --oem 1 --psm 6 makebox

1 32 15 40 50 0
2 113 16 136 51 0
9 139 15 162 51 0
3 167 15 186 51 0
. 193 16 200 21 0
0 206 15 227 51 0
0 232 15 254 51 0

tesseract -v

tesseract 5.0.0-alpha-322-g74ac
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0

Issue is fixed in master branch.

@zdenop Please close.

RicketyRick · 2019-12-12T07:38:56Z

I am not sure if the following observation is covered by this ticket. But maybe it needs to be reopened.
We had several issues with bounding box coordinates and tried makebox, hocr with the executable and even on API level. Now we tested the often used "thousand Billion" image on the latest 5.0.0-alpha version on windows to check if it was fixed but again got the same starting coordinates for some letters:

   <span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span>
   <span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>

We have this issue on all of our example files and on all 4.* versions of tesseract, too.

Thanks for all you work and have a great christmas time!

zdenop added bug accuracy labels Oct 24, 2018

FrkBo mentioned this issue Oct 31, 2018

Noise characters recognized with bbox as the entire page #1192

Open

lorenzob mentioned this issue Feb 14, 2019

Wrong bounding box coordinates reported (regression from 4.0.0) #2240

Closed

zdenop mentioned this issue Mar 6, 2019

Incorrect bounding boxes #2264

Closed

StephenRUK mentioned this issue Apr 16, 2019

More recent Tesseract build simonflueckiger/tesserocr-windows_build#7

Closed

Shreeshrii mentioned this issue Jul 16, 2019

Wrong coordinates on character level #2521

Closed

stweil mentioned this issue Jul 16, 2019

Implemented improved character bounding box algorithm #2576

Merged

zdenop closed this as completed Jul 19, 2019

RicketyRick mentioned this issue Dec 18, 2019

Overlapping Character Boundingboxes #2825

Open

amitdo added the bounding box label Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong symbol bounding box coordinates #2024

Wrong symbol bounding box coordinates #2024

vidiecan commented Oct 23, 2018

amitdo commented Oct 23, 2018 •

edited

Loading

vidiecan commented Oct 23, 2018

zdenop commented Jul 1, 2019

stweil commented Jul 3, 2019

noahmetzger commented Jul 3, 2019

StephenRUK commented Jul 12, 2019

zdenop commented Jul 12, 2019

StephenRUK commented Jul 12, 2019

stweil commented Jul 16, 2019

Shreeshrii commented Jul 19, 2019

RicketyRick commented Dec 12, 2019 •

edited

Loading

Wrong symbol bounding box coordinates #2024

Wrong symbol bounding box coordinates #2024

Comments

vidiecan commented Oct 23, 2018

amitdo commented Oct 23, 2018 • edited Loading

vidiecan commented Oct 23, 2018

zdenop commented Jul 1, 2019

stweil commented Jul 3, 2019

noahmetzger commented Jul 3, 2019

StephenRUK commented Jul 12, 2019

zdenop commented Jul 12, 2019

StephenRUK commented Jul 12, 2019

stweil commented Jul 16, 2019

Shreeshrii commented Jul 19, 2019

RicketyRick commented Dec 12, 2019 • edited Loading

amitdo commented Oct 23, 2018 •

edited

Loading

RicketyRick commented Dec 12, 2019 •

edited

Loading