Allow for text angle/gradient to be retrieved #4070

Balearica · 2023-05-09T03:54:08Z

Tesseract already calculates the average gradient (angle) of text lines within Textord::TextordPage at present. The gradient of the text is useful information (as Tesseract performs poorly when the gradient is not [almost] zero), however the average gradient only exists within the Textord::TextordPage function at present, with no way for users to access it. Using the API, getting the gradient currently requires running Recognize or AnalyseLayout and using the results to manually re-calculate the gradient.

This PR allows for users to directly retrieve the existing average gradient value calculated in Textord::TextordPage using a function named GetGradient. This function can be called any time after FindLines has been run. ~~I also made FindLines public so it can be run directly without running Recognize or AnalyseLayout first (running AnalyseLayout would result in paragraph recognition being run twice).~~

I've already used this branch to implement an auto-rotate feature in the latest version of Tesseract.js that (unlike adding an auto-rotate pre-processing step) does not negatively impact performance for images without problematic rotation. A basic script using GetGradient is below for demonstrative purposes, along with a test image (named rotate_image.png in the code). Resolves #3836.

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead("rotate_image.png");
    api->SetImage(image);

    // Find lines, get average gradient
    api->AnalyseLayout();
    float gradient = api->GetGradient();

    printf("Average Gradient: %f\n", gradient);

    // Destroy used object and release memory
    api->End();
    delete api;
    pixDestroy(&image);

    return 0;
}

Average Gradient: 0.085301

amitdo · 2023-05-09T16:01:48Z

I also made FindLines public so it can be run directly without running Recognize or AnalyseLayout first (running AnalyseLayout would result in paragraph recognition being run twice).

So paragraph detection will run twice if AnalyseLayout is followed by Recognize. This looks like a bug.

It can be solved by using this condition:

tesseract/src/api/baseapi.cpp

Lines 903 to 907 in 424b17f

 bool wait_for_text = true; 

 GetBoolVariable("paragraph_text_based", &wait_for_text); 

 if (!wait_for_text) { 

 DetectParagraphs(false); 

 }

also in AnalyseLayout.

Then we won't need to expose FindLines.

What do you think?

zdenop · 2023-05-09T18:47:27Z

Why not use leptonica solution?
E.g.

#include <leptonica/allheaders.h>

int main() {
  PIX *pix2;
  l_float32 angle, conf;
  Pix *image = pixRead("rotate_image.png");
  pix2 = pixFindSkewAndDeskew(image, 2, &angle, &conf);
  printf("Skew angle: %7.2f degrees; %6.2f conf\n", angle, conf);
  pixWrite("fixed_rotate_image.png", pix2, IFF_PNG);

  pixDestroy(&image);
  pixDestroy(&pix2);
  return 0;
}

amitdo · 2023-05-09T19:28:16Z

Why not use leptonica solution?

Because it is not necessary. Tesseract does it anyway.

zdenop · 2023-05-09T19:45:09Z

My understanding is that users want to fix rotation before running OCR.
This feature will require to use SetImage twice (first to get the angle and then for the corrected image). I guess my proposal will be faster and without the need to touch tesseract API ;-)

Balearica · 2023-05-09T19:46:46Z

My understanding is that users want to fix rotation before running OCR. This feature will require to use SetImage twice (first to get the angle and then for the corrected image). I guess my proposal will be faster and without the need to touch tesseract API ;-)

First, using the text gradient number that Tesseract already calculates does not add any extra steps or runtime for images that are not flagged as having problematic text angles. While I'm sure that adding additional pre-processing steps is a viable solution in many contexts (e.g. processing scanned documents), when building applications where speed is a very high priority, sending all input images through an extra step is sub-optimal. My use case here is maintaining Tesseract.js, which is primarily used in web applications rather than document processing.

Second, Leptonica uses a different methodology from Tesseract for calculating the angle of the page, so using Leptonica's algoirthm adds another point of failure. While both Tesseract and Leptonica sometimes calculate the angle incorrectly, if Tesseract calculates the angle incorrectly the OCR results were almost certainly going to be bad anyway (as this calculation occurs during the line detection step). Using the angle Tesseract calculates is inherently low-risk in that regard. On the other hand, as Leptonica uses a different algorithm, it can calculate text gradient incorrectly in a way that harms images that would otherwise produce high-quality results. When testing both solutions with sample documents, I found the implementation using the angle calculated by Tesseract to produce better results.

Overall, while an individual user may decide that using a separate auto-rotate script is better for their workflow, I think that the angle calculated by Tesseract is useful information and do not believe there's any reason it should not be accessible to the user.

Balearica · 2023-05-10T00:32:03Z

I also made FindLines public so it can be run directly without running Recognize or AnalyseLayout first (running AnalyseLayout would result in paragraph recognition being run twice).

So paragraph detection will run twice if AnalyseLayout is followed by Recognize. This looks like a bug.

It can be solved by using this condition:

tesseract/src/api/baseapi.cpp

Lines 903 to 907 in 424b17f

bool wait_for_text = true;

GetBoolVariable("paragraph_text_based", &wait_for_text);

if (!wait_for_text) {

DetectParagraphs(false);

}

also in AnalyseLayout.

Then we won't need to expose FindLines.

What do you think?

I think this makes sense conceptually, however if I understand correctly, such a change would impact the results returned by the AnalyseLayout API function when run using default settings (paragraph_text_based is true by default). I'm always hesitant to advocate for any change that could impact existing code. If there is opposition to making FindLines public I think it would be fine to leave FindLines protected and use AnalyseLayout instead even if paragraph detection runs multiple times. I timed these functions using sample image I posted above, and FindLines took ~75ms to run while DetectParagraphs took ~2ms to run, so the overall impact of this inefficiency appears to be fairly minor.

amitdo · 2023-05-10T08:25:13Z

You are right that my suggestion changes the current behavior, so it's not a good idea.

I still think we should not expose FindLines just to make this use case work.

My new suggestion is to choose one of these option:

Take my previous suggestion, but instead of reusing paragraph_text_based, add a new user configurable variable and use it inside AnalyseLayout.
Use AnalyseLayout without changes. As you said, speed wise, the impact of the two calls to DetectParagraphs is quite small.

Apart from this small issue, I like the new feature.

Balearica · 2023-05-10T18:42:26Z

@amitdo I changed FindLines back to a protected function. I'm fine with running AnalyseLayout as written.

amitdo · 2023-05-10T19:28:26Z

@zdenop,

After reading @Balearica's answer to your question, do you object to merging this PR?

@stweil, can we merge it?

src/ccmain/tesseractclass.h

zdenop · 2023-05-10T20:31:07Z

First of all: changing/extending C++-API should be reflected in C-API too.

Next: playing with public API has an impact on symver, which has an impact on including new versions in major Linux distributions. This should be carefully planned

Personally (e.g. next is not a showstopper), I prefer that image-related operations are handled by Leptonica. Maybe I miss something so maybe information on how the gradient is planned to use would help me to make it clear ;-) (e,.g. to measure speed/performance).
Also, I would like to get an example image where Tesseract provides better calculation of text gradient than Leptonica, so Dan can have a look at it...

Balearica · 2023-05-10T21:44:57Z

First of all: changing/extending C++-API should be reflected in C-API too.

Good point, I edited the C API to reflect this change.

amitdo · 2023-05-12T15:20:03Z

Regarding semver, the current revision of this PR only adds one method to the public API, so the next version should be 5.4.0.

wvanrensburg · 2024-04-03T19:15:58Z

Anyone know when this will get merged? Almost a year now

stweil

Thank you!

Signed-off-by: Stefan Weil <[email protected]>

Balearica added 2 commits May 8, 2023 17:43

Added GetGradient function

a82b82f

Updated gradient_

df967ab

Balearica mentioned this pull request May 9, 2023

Detect text rotation without running recognition #3836

Closed

egorpugin approved these changes May 9, 2023

View reviewed changes

Changed FindLines back to protected function

2a93932

amitdo approved these changes May 10, 2023

View reviewed changes

stweil requested changes May 10, 2023

View reviewed changes

src/ccmain/tesseractclass.h Show resolved Hide resolved

Balearica added 3 commits May 10, 2023 14:02

Minor update

e0e1644

Minor update

9452c5a

Added GetGradient functions to C API

b454c33

zdenop added this to the 5.4.0 milestone May 12, 2023

GerHobbelt added a commit to GerHobbelt/tesseract that referenced this pull request Aug 11, 2023

adjust code to match tesseract-ocr#4070: export orientation.

b9a1070

GerHobbelt added a commit to GerHobbelt/tesseract that referenced this pull request Aug 11, 2023

added example code from tesseract-ocr#4070 : GetGradient()

593aa26

Balearica mentioned this pull request Dec 14, 2023

Orientation detection "asymmetrical" #4116

Open

egorpugin approved these changes Apr 3, 2024

View reviewed changes

zdenop requested a review from stweil May 12, 2024 13:16

stweil approved these changes May 12, 2024

View reviewed changes

stweil merged commit c23792b into tesseract-ocr:main May 12, 2024
8 checks passed

zdenop referenced this pull request May 12, 2024

Create new release 5.3.5-rc1

cab5658

Signed-off-by: Stefan Weil <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for text angle/gradient to be retrieved #4070

Allow for text angle/gradient to be retrieved #4070

Balearica commented May 9, 2023 •

edited

amitdo commented May 9, 2023 •

edited

zdenop commented May 9, 2023

amitdo commented May 9, 2023

zdenop commented May 9, 2023

Balearica commented May 9, 2023 •

edited

Balearica commented May 10, 2023

amitdo commented May 10, 2023 •

edited

Balearica commented May 10, 2023

amitdo commented May 10, 2023

zdenop commented May 10, 2023

Balearica commented May 10, 2023 •

edited

amitdo commented May 12, 2023

wvanrensburg commented Apr 3, 2024

stweil left a comment

Allow for text angle/gradient to be retrieved #4070

Allow for text angle/gradient to be retrieved #4070

Conversation

Balearica commented May 9, 2023 • edited

amitdo commented May 9, 2023 • edited

zdenop commented May 9, 2023

amitdo commented May 9, 2023

zdenop commented May 9, 2023

Balearica commented May 9, 2023 • edited

Balearica commented May 10, 2023

amitdo commented May 10, 2023 • edited

Balearica commented May 10, 2023

amitdo commented May 10, 2023

zdenop commented May 10, 2023

Balearica commented May 10, 2023 • edited

amitdo commented May 12, 2023

wvanrensburg commented Apr 3, 2024

stweil left a comment

Choose a reason for hiding this comment

Balearica commented May 9, 2023 •

edited

amitdo commented May 9, 2023 •

edited

Balearica commented May 9, 2023 •

edited

amitdo commented May 10, 2023 •

edited

Balearica commented May 10, 2023 •

edited