Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the OCR engine that you use, three questions need your help #9

Open
hanquansanren opened this issue Jun 13, 2022 · 19 comments
Open

Comments

@hanquansanren
Copy link

image
Q1: Hello, in section5.1 of your paper, I notice you used Pytesseract V3.02.02, as shown in the above picture ↑
But on the homepage of pytesseract, I only find the version of 0.3.~ or 0.2.~, could you please tell me the detailed version you use. By the way, in the paper of DewarpNet, they specify the Pytesseract on version 0.2.9. Are there big differences caused by the version of OCR engine?

Q2: For the calculation of CER metric, it needs the ground true of each character in images, I also notice your repository provides 60 images index for OCR metric test, while the DewarpNet provided 25 images index as well as ground true in JSON form. Can you tell me how do you annotate the ground true? And if possible, can you share your ground true file?

In addition, I also noticed 25 ground trues in DewarpNet have several label errors, I guess they also use some OCR metric. If you also use OCR engine to label the ground true, can your some me more details about how do you annotate?

Q3: In fact, I also try to test the OCR performance over your model output. However, neither Pytesseract version 0.3.~ nor 0.2.~ achieve the same result in paper.
Here is my OCR test code:

from PIL import Image
import pytesseract

import json
import os
from os.path import join as pjoin
from pathlib import Path
import numpy as np


def edit_distance(str1, str2):
    """计算两个字符串之间的编辑距离。
    Args:
        str1: 字符串1。
        str2: 字符串2。
    Returns:
        dist: 编辑距离。
    """
    matrix = [[i + j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
    for i in range(1, len(str1) + 1):
        for j in range(1, len(str2) + 1):
            if str1[i - 1] == str2[j - 1]:
                d = 0
            else:
                d = 1
            matrix[i][j] = min(matrix[i - 1][j] + 1, matrix[i][j - 1] + 1, matrix[i - 1][j - 1] + d)
    dist = matrix[len(str1)][len(str2)]
    return dist



def get_cer(src, trg):
    """把源字符串src修改成目标字符串trg的字符错误率。
    Args:
        src: 源字符串。
        trg: 目标字符串。
    Returns:
        cer: 字符错误率。
    """
    dist = edit_distance(src, trg)
    cer = dist / len(trg)
    return cer

if __name__ == "__main__":
    reference_list=[]
    reference_index=[] 
    img_dirList=[] 
    cer_list=[]  
    r_path = pjoin('./doctr/')
    reslut_file = open('result1.log', 'w')
    print(pytesseract.get_languages(config=''))
    with open('ocr_files.txt','r') as fr:	
        for l,line in enumerate(fr):
            reference_list.append(line)
            reference_index.append(l)
            print(len(line),line)
            print(len(line),line,file=reslut_file)
            h1str="./doctr/"+line[7:-1]+"_1 copy.png"
            h2str="./doctr/"+line[7:-1]+"_2 copy.png"
            print(h1str,h2str)
            h1=pytesseract.image_to_string(Image.open(h1str),lang='eng')
            h2=pytesseract.image_to_string(Image.open(h2str),lang='eng')

            with open('tess_gt.json','r') as file:
                str = file.read()
                r = json.loads(str).get(line[:-1])
            cer_value1=get_cer(h1, r)
            cer_value2=get_cer(h2, r)
            print(cer_value1,cer_value2)
            print(cer_value1,cer_value2,file=reslut_file)
            cer_list.append(cer_value1)
            cer_list.append(cer_value2)
    
    print(np.mean(cer_list)) 
    print(np.mean(cer_list),file=reslut_file)
    reslut_file.close()

In brief, the core code for OCR is h1=pytesseract.image_to_string(Image.open(h1str),lang='eng') , with which I only get CER of 0.6. This result is far away from 0.2~0.3 CER as previous models.

Could you share your OCR version and code for the OCR metric? Many thanks for your generous response!

@fh2019ustc
Copy link
Owner

fh2019ustc commented Jun 16, 2022

Thanks for your nice concern and sorry for the late reply.
I am so sorry that the OCR environment for DocTr is missed.
However, you could follow the setting of our new work DocScanner.
Specifically, the version of pytesseract is 0.3.8, and the version of Tesseract is recent 5.0.1.20220118.
We follow the OCR evaluation settings of DewarpNet and DocTr, which use 50 and 60 document images of the DocUNet Benchmark dataset.
The results are shown in Table 2.

Besides, I think it is unnecessary to annotate the GT string manually.
This is because, if a distorted image is perfectly rectified, its recognized string should be consistent with the string recognized in the GT image.
Hence, we just use the recognized string of the GT image as the reference string to calculate ED and CER.
We provide our OCR evaluation code for you as follows,

def Levenshtein_Distance(str1, str2):
    matrix = [[ i + j for j in range(len(str2) + 1)] for i in range(len(str1) + 1)]
    for i in range(1, len(str1)+1):
        for j in range(1, len(str2)+1):
            if(str1[i-1] == str2[j-1]):
                d = 0
            else:
                d = 1 
            matrix[i][j] = min(matrix[i-1][j]+1, matrix[i][j-1]+1, matrix[i-1][j-1]+d)

    return matrix[len(str1)][len(str2)]

def cal_cer_ed(path_ours, tail='_rec'):
    path_gt='./GT/'
    N=66
    cer1=[]
    cer2=[]
    ed1=[]
    ed2=[]
    check=[0 for _ in range(N+1)]
    lis=[1,9,10,19,20,21,22,23,24,27,30,31,32,34,35,36,37,38,39,40,44,45,46,47,49]  # dewarpnet
    for i in range(1,N):
        if i not in lis:
            continue
        gt=Image.open(path_gt+str(i)+'.png')
        img1=Image.open(path_ours+str(i)+'_1' + tail)
        img2=Image.open(path_ours+str(i)+'_2' + tail)
        content_gt=pytesseract.image_to_string(gt)
        content1=pytesseract.image_to_string(img1)
        content2=pytesseract.image_to_string(img2)
        l1=Levenshtein_Distance(content_gt,content1)
        l2=Levenshtein_Distance(content_gt,content2)
        ed1.append(l1)
        ed2.append(l2)
        cer1.append(l1/len(content_gt))
        cer2.append(l2/len(content_gt))
        check[i]=cer1[-1]
    print('CER: ', (np.mean(cer1)+np.mean(cer2)) / 2.)
    print('ED:  ', (np.mean(ed1)+np.mean(ed2)) / 2.)

def evalu(path_ours, tail):
    cal_cer_ed(path_ours, tail)

Hope this helps~!

@hanquansanren
Copy link
Author

Thanks a lot for your detailed explanation. Based on your code, Tesseract version and PyTesseract version, I have achieved the same CER performance in paper.

The DocScanner is another great work which achieves the best MS-SSIM, I will pay some time to follow it next step.

@fh2019ustc
Copy link
Owner

@hanquansanren Thanks for your feedback.

@an1018
Copy link

an1018 commented Oct 26, 2022

@fh2019ustc I'vd installed the corresponding version, but achived differenet ED value(607), while the CER value(0.20) is the same as in table2.
image

Eval dataset: DocUnet
gt:scan images
pred:crop images

@fh2019ustc
Copy link
Owner

@an1018 Hi, please use the OCR eval code in our repo, in which we have updated the image list used in the DewarpNet.
Then you can obtain the performance as follows,
image

@fh2019ustc
Copy link
Owner

fh2019ustc commented Oct 26, 2022

@an1018 For more OCR performance of other methods under the two settings (DocTr and DewarpNet), you can refer to the DocScanner.

@fh2019ustc
Copy link
Owner

@an1018 Hope to get your reply.

@an1018
Copy link

an1018 commented Oct 26, 2022

@fh2019ustc Yes,I use OCR_eval.py for evaluation,but there are still some problems:
Q1: Why is the performace different from the performac in the DocTr paper
image
image

Q2:And the performance of DocTr in the following table is based on the geometric rectified results of GeoTr, not based on the illumination correction of IllTr?
image

Q3: I still can't get the same peformance by using the rectified images from Baidu Cloud

python OCR_eval.py --path_gt 'docunet/scan/' --path_ours 'Rectified_DocUNet_DocTr/' --tail ' copy_rec.png'

note:'docunet/scan/' is the scan images of docunet
image

Q4:How can I get the same result without using the rectified images from Baidu Cloud

python inference.py --distorrted_path 'docunet/crop/' --gsave_path './geo_rec' --isave_path './ill_rec/' --ill_rec True
python OCR_eval.py --path_gt 'docunet/scan/' --path_ours 'ill_rec/' --tail ' copy_ill.png'

@fh2019ustc
Copy link
Owner

fh2019ustc commented Oct 26, 2022

@an1018 Note that In the DocUNet Benchmark, the '64_1.png' and '64_2.png' distorted images are rotated by 180 degrees, which do not match the GT documents. It is ignored by most of the existing works. Before the evaluation, please make a check.
This dataset error is found in April this year when we are preparing our major for our PAMI submission DocScanner. But our DocTr is accepted in June of 2021. So we update the performance in our repo.
Such an error is ignored by most of works in this field. So in our PAMI submission DocScanner and ECCV 2022 paper DocGeoNet, we update the performance of all previous methods.

@fh2019ustc
Copy link
Owner

@an1018 For your Q2, this performance is based on GeoTr.
image

@fh2019ustc
Copy link
Owner

fh2019ustc commented Oct 26, 2022

@an1018 For Q3 and Q4, to reproduce the above performance, please use the geometric rectified images rather than the illumination corrected images.

@an1018
Copy link

an1018 commented Oct 26, 2022

@fh2019ustc Thanks for your quick response, I'll try again and give you feedback

@fh2019ustc fh2019ustc reopened this Oct 26, 2022
@an1018
Copy link

an1018 commented Oct 27, 2022

@fh2019ustc Hi, I'vd installed Tesseract(v5.0.1) from Git, and downloaded the eng model. The performance is similar to the following performance, but there are still some differences. What else could be causing it?

CER: 0.1759
ED: 470.33

image

Here are some of my configurations:
1)images:
gt images: the scan images of DocUNet
pred images : Baidu Cloud in your repo

image

2)tesseract version:
image

3) eng model:
image

@fh2019ustc
Copy link
Owner

image
image
This is version information for your reference.
Besides, what is your performance based on Setting 2?

@an1018
Copy link

an1018 commented Oct 27, 2022

1)How can I install 5.0.1.20220118, not 5.0.1?(My environment is Linux Ubuntu)
2)The performance based on Setting 2:
ED:733.58
CER:0.1859

@fh2019ustc
Copy link
Owner

fh2019ustc commented Oct 27, 2022

Hi, this is the link for Windows. Our enev is Windows. Hope to get your reply.
This is the link for Ubuntu, but we do not have a try.

@an1018
Copy link

an1018 commented Oct 28, 2022

Oh, I can get the same performance in Windows environment. But for Ubuntu,I can't find Tesseract v5.0.1.20220118

@fh2019ustc
Copy link
Owner

@an1018 Thanks for your reply. For OCR evaluation, I think that you can compare the performance with the same environment, whether it is windows or ubuntu.

@an1018
Copy link

an1018 commented Oct 28, 2022

Yes, Thanks for your continuous technical support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants