The extracted table box coordinates do not correspond to the images converted from the PDF #486

SWHL · 2022-06-15T13:22:42Z

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug

The extracted table box coordinates do not correspond to the images converted from the PDF.

Environment

OS: CentOS 7
Python: 3.7.11
camelot-py: 0.10.1

Reproduction

Run the following code: (foo.pdf)

import camelot
import copy
import cv2


def draw_bbox(img, start_point, end_point, ratio=1):
    start_point = tuple(map(lambda x: round(x * ratio), start_point))
    end_point = tuple(map(lambda x: round(x * ratio), end_point))
    cv2.rectangle(img, start_point, end_point, (0, 255, 0), 2)


pdf_path = 'foo.pdf'
tables = camelot.read_pdf(pdf_path, flavor='lattice', backend="poppler")
table = tables[0]

table_x0, table_y0, table_x1, table_y1 = table._bbox
img = table._image[0]

tmp_img = copy.deepcopy(img)
draw_bbox(tmp_img,
          start_point=(table_x0, table_y0),
          end_point=(table_x1, table_y1),
          ratio=1)
cv2.imwrite('foo.jpg', tmp_img)

The code runs as follows, the green rectangle box indicates the position of the extracted table coordinates in the extracted image from the PDF. Meanwhile, the red rectangle box at the bottom of the image is the correct desired box.

Bug fix

import camelot
import copy
import cv2


def draw_bbox(img, start_point, end_point, ratio=1):
    start_point = tuple(map(lambda x: round(x * ratio), start_point))
    end_point = tuple(map(lambda x: round(x * ratio), end_point))
    cv2.rectangle(img, start_point, end_point, (0, 255, 0), 2)


pdf_path = 'foo.pdf'
tables = camelot.read_pdf(pdf_path, flavor='lattice', backend="poppler")
table = tables[0]

table_x0, table_y0, table_x1, table_y1 = table._bbox
img = table._image[0]

ratio = 300 / 72
new_tmp_img = copy.deepcopy(img)
pdf_height = img.shape[0] / ratio
draw_bbox(new_tmp_img,
          start_point=(table_x0, pdf_height - table_y0),
          end_point=(table_x1, pdf_height - table_y1),
          ratio=ratio)
cv2.imwrite('foo_right.jpg', new_tmp_img)

The text was updated successfully, but these errors were encountered:

LxYuan0420 · 2022-07-12T04:14:53Z

Curious to know how you get this exact value of ratio = 300 / 72 and does it work for another pdf?

SWHL · 2022-07-12T06:51:20Z

Answer the question 1:

When the camelot package obtains the box coordinates by the pdfminer package, whose resolution's default value is 72 (I fogot to where I saw it), but when the camelot obtains the image by the read_pdf function, whose resolution's default value is 300.

camelot/camelot/io.py

Line 93 in cd8ac79

resolution* : int, optional (default: 300)

Answer the question 2:

You can try others.

baleris · 2023-05-17T13:02:01Z

@SWHL Tis really helped me to understand the conversion. However i have a similar problem in which i have a coordinates of an object got it from a page image(pdf page have been converted into page image). Now i want to convert these coordinates into camelot pdf level coordinates. I tried to follow above logic in reverse order which is not successful.
I am new to this, any leads can give some hints/logic for page image co-ordinates conversion to pdf level co-ordinates ?
i have object coordinates - x0,y0,x1,y1 (from page image), also have page image width and height. Also holding target pdf height n width.
Ex: (x0,y0,x1,y1) = 188, 393, 1576, 1498
pageImage height,width = (3300, 2550)
pdf height,width = (792, 612)

SWHL · 2023-05-18T00:44:18Z

@baleris You can try it by this:

$$\frac{2550}{612} = \frac{188}{x} \rightarrow x?$$

$$\frac{3300}{792} = \frac{393}{y} \rightarrow y?$$

baleris · 2023-05-18T05:17:45Z

@SWHL, this has not worked, when i checked camelot detected table coordinates they are totally different. For example for the above mentioned coordinates, camelot's relevant coordinates are (72.0, 295.2, 563.04, 648.72)

baleris · 2023-05-29T09:55:18Z

@SWHL i see in your above solution you are getting a page image from img = table._image[0] if i have a borderless table and i would like to pass flavor = ''stream' : camelot.read_pdf(src,flavor = 'stream') in tis case how could i get image ? If i try to do same like table._image[0] i get an error message.

Any suggestions to get image for "stream" parameter/borderless tables ?

SWHL · 2023-05-30T05:49:55Z

You can refer this:

camelot/tests/test_common.py

Lines 35 to 40 in cd8ac79

 def test_stream(): 

 df = pd.DataFrame(data_stream) 

 filename = os.path.join(testdir, "health.pdf") 

 tables = camelot.read_pdf(filename, flavor="stream") 

 assert df.equals(tables[0].df)

The current issue is beyond the scope of this issue. Suggest opening a new issue to discuss.

baleris · 2023-05-30T11:05:36Z

@SWHL as suggested i have raised new issue #497

baleris mentioned this issue May 30, 2023

extracted table cell coordinates(stream) do not corresponds to page image converted from pdf #497

Open

Siddharth1India mentioned this issue Jun 13, 2023

Camelot image co-ordinates to PDF box camelot-dev/camelot#377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The extracted table box coordinates do not correspond to the images converted from the PDF #486

The extracted table box coordinates do not correspond to the images converted from the PDF #486

SWHL commented Jun 15, 2022

LxYuan0420 commented Jul 12, 2022

SWHL commented Jul 12, 2022

baleris commented May 17, 2023

SWHL commented May 18, 2023

baleris commented May 18, 2023

baleris commented May 29, 2023

SWHL commented May 30, 2023

baleris commented May 30, 2023

The extracted table box coordinates do not correspond to the images converted from the PDF #486

The extracted table box coordinates do not correspond to the images converted from the PDF #486

Comments

SWHL commented Jun 15, 2022

LxYuan0420 commented Jul 12, 2022

SWHL commented Jul 12, 2022

Answer the question 1:

Answer the question 2:

baleris commented May 17, 2023

SWHL commented May 18, 2023

baleris commented May 18, 2023

baleris commented May 29, 2023

SWHL commented May 30, 2023

baleris commented May 30, 2023