'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

remi-braun · 2024-11-04T15:56:33Z

Extracting text used to extract all words, now at least one is missing from the bounding box

Environment

Both Linux and Windows.
v5.0.1 has been tested and is fine.

Code + PDF

With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf

def extract_map_text(
    page: PageObject,
    x_min: float = 0,
    x_max: float = 1,
    y_min: float = 0,
    y_max: float = 1,
    sep=";",
):
    """
    Extract the text from the given page (in PDF)
    Args:
        page (PageObject): PDF page
        x_thresh (float): Threshold (%age of total width) on x-axis to read the text only on the right of it

    Returns:
        str: Extracted text

    """
    parts = []

    def visitor_right(text, cm, tm, font_dict, font_size):
        x = tm[4]
        y = tm[5]
        in_window = (
            float(x_max * float(page.cropbox.right))
            > x
            > float(x_min * float(page.cropbox.right))
        ) and (
            float(y_max * float(page.cropbox.top))
            > y
            > float(y_min * float(page.cropbox.top))
        )
        if in_window and text not in ["!", "", " "]:
            parts.append(text)

    page.extract_text(orientations=0, visitor_text=visitor_right)
    page_txt = (
        sep.join([p for p in parts if p not in ["\n"]])
        .replace("\n", " ")
        .replace("\x00", "")
        .replace("\xa0", " ")
    )
    return page_txt

Running this snippet:

extract_map_text(
    page, x_min=0.8, y_min=0.6, y_max=0.8, sep=" "
).replace("  ", " ")

With pypdf v5.1.0, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

With pypdf v5.0.1, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. Road 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-11-04T16:18:05Z

Thanks for the report. We had some changes to the text extraction code for version 5.1.0, although I am not really sure why it would affect these positions.

remi-braun · 2025-02-04T17:22:09Z

Hi,

Just to notify this is still happening with 5.2.0.
For example, in this PDF EMSR698_AOI01_DEL_MONIT01_149000_map_v1.pdf, this line On the 7 October 2023 at 12:00 UTC, extensive flooding is forecast: is positionned to 0.0, 0.0.

This breaks my file validation (I'm looking for lines correctly spelleds and placed in the PDF) by breaking the text visitor.

This wasn't happening with 5.1.0

stefan6419846 · 2025-02-04T17:55:46Z

Is this a typo or do you want to indicate that version 5.2.0 changed/broke more stuff than 5.1.0?

Apart from this: With pypdf being a FOSS project, I will happily accept PRs which fix this behavior. I will not have any resources for digging into this myself in the near future, thus anyone who wants to look into this (pinpointing the relevant change, identifying why it broke it, preparing a fix) is highly appreciated as usual.

remi-braun · 2025-02-04T18:08:13Z

Sadly, I think that 5.2.0 broke more stuff than 5.1.0 😢

Honestly, I would love to help as I often do, but I just don't understand the whereabouts of PDF parsing :(
I just can say that for some text, the TM matrix seems broken for some reason.
In the linked PDF, all TM matrix lines after the orange titles are for example broken (x, y = 0, 0)

.

stefan6419846 · 2025-02-04T18:19:34Z

Honestly, I would love to help as I often do, but I just don't understand the whereabouts of PDF parsing :(

While it does not solve the actual issue, getting to know the corresponding merge commits/PRs is something which should not require knowledge about PDF internals, but might already help here. With this, further analysis ideally can concentrate on much less code.

remi-braun · 2025-02-05T07:56:02Z

I don't really understand what you mean 😅

Anyway, in the example PDF, the faulty lines are (with corresponding matrices):

Placename: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
Pre-event image: Sentinel-2A/B (2023) (acquired: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
On the 7 October 2023 at 12:00 UTC, extensive flooding is forecast: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
4.5: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
9: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
km: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]

They are not that many 🤔

I will use workarounds don't worry.

stefan6419846 · 2025-02-05T07:59:42Z

To simplify further analysis, it would be great to know which actual pypdf commit(s) lead to this. Checking this should be rather straightforward and not require any deeper PDF knowledge by basically running your example code against every commit of the corresponding release until you see the bad behavior.

remi-braun changed the title ~~extract_text~~ 'extract_text' text matrix seems to be sometimes broken with v5.1.0 Nov 4, 2024

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

remi-braun commented Nov 4, 2024 •

edited

Loading

stefan6419846 commented Nov 4, 2024

remi-braun commented Feb 4, 2025 •

edited

Loading

stefan6419846 commented Feb 4, 2025

remi-braun commented Feb 4, 2025 •

edited

Loading

stefan6419846 commented Feb 4, 2025

remi-braun commented Feb 5, 2025

stefan6419846 commented Feb 5, 2025

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

Comments

remi-braun commented Nov 4, 2024 • edited Loading

Environment

Code + PDF

stefan6419846 commented Nov 4, 2024

remi-braun commented Feb 4, 2025 • edited Loading

stefan6419846 commented Feb 4, 2025

remi-braun commented Feb 4, 2025 • edited Loading

stefan6419846 commented Feb 4, 2025

remi-braun commented Feb 5, 2025

stefan6419846 commented Feb 5, 2025

remi-braun commented Nov 4, 2024 •

edited

Loading

remi-braun commented Feb 4, 2025 •

edited

Loading

remi-braun commented Feb 4, 2025 •

edited

Loading