Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932

Open
remi-braun opened this issue Nov 4, 2024 · 7 comments
Open
Labels
is-regression Regression introduced as a side-effect of another change workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@remi-braun
Copy link

remi-braun commented Nov 4, 2024

Extracting text used to extract all words, now at least one is missing from the bounding box

Environment

Both Linux and Windows.
v5.0.1 has been tested and is fine.

Code + PDF

With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf

def extract_map_text(
    page: PageObject,
    x_min: float = 0,
    x_max: float = 1,
    y_min: float = 0,
    y_max: float = 1,
    sep=";",
):
    """
    Extract the text from the given page (in PDF)
    Args:
        page (PageObject): PDF page
        x_thresh (float): Threshold (%age of total width) on x-axis to read the text only on the right of it

    Returns:
        str: Extracted text

    """
    parts = []

    def visitor_right(text, cm, tm, font_dict, font_size):
        x = tm[4]
        y = tm[5]
        in_window = (
            float(x_max * float(page.cropbox.right))
            > x
            > float(x_min * float(page.cropbox.right))
        ) and (
            float(y_max * float(page.cropbox.top))
            > y
            > float(y_min * float(page.cropbox.top))
        )
        if in_window and text not in ["!", "", " "]:
            parts.append(text)

    page.extract_text(orientations=0, visitor_text=visitor_right)
    page_txt = (
        sep.join([p for p in parts if p not in ["\n"]])
        .replace("\n", " ")
        .replace("\x00", "")
        .replace("\xa0", " ")
    )
    return page_txt

Running this snippet:

extract_map_text(
    page, x_min=0.8, y_min=0.6, y_max=0.8, sep=" "
).replace("  ", " ")

With pypdf v5.1.0, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

With pypdf v5.0.1, the output is:

'3.5 km Potentially Affected Built-up and Transportations Built-Up 1 No. Road 0.9 km Flooded area 33.1 ha Potentially affected population ~ 200'

The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.

@remi-braun remi-braun changed the title extract_text 'extract_text' text matrix seems to be sometimes broken with v5.1.0 Nov 4, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the report. We had some changes to the text extraction code for version 5.1.0, although I am not really sure why it would affect these positions.

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Nov 4, 2024
@remi-braun
Copy link
Author

remi-braun commented Feb 4, 2025

Hi,

Just to notify this is still happening with 5.2.0.
For example, in this PDF EMSR698_AOI01_DEL_MONIT01_149000_map_v1.pdf, this line On the 7 October 2023 at 12:00 UTC, extensive flooding is forecast: is positionned to 0.0, 0.0.

This breaks my file validation (I'm looking for lines correctly spelleds and placed in the PDF) by breaking the text visitor.

This wasn't happening with 5.1.0

@stefan6419846
Copy link
Collaborator

Is this a typo or do you want to indicate that version 5.2.0 changed/broke more stuff than 5.1.0?

Apart from this: With pypdf being a FOSS project, I will happily accept PRs which fix this behavior. I will not have any resources for digging into this myself in the near future, thus anyone who wants to look into this (pinpointing the relevant change, identifying why it broke it, preparing a fix) is highly appreciated as usual.

@remi-braun
Copy link
Author

remi-braun commented Feb 4, 2025

Sadly, I think that 5.2.0 broke more stuff than 5.1.0 😢

Honestly, I would love to help as I often do, but I just don't understand the whereabouts of PDF parsing :(
I just can say that for some text, the TM matrix seems broken for some reason.
In the linked PDF, all TM matrix lines after the orange titles are for example broken (x, y = 0, 0)

Image

.

@stefan6419846
Copy link
Collaborator

Honestly, I would love to help as I often do, but I just don't understand the whereabouts of PDF parsing :(

While it does not solve the actual issue, getting to know the corresponding merge commits/PRs is something which should not require knowledge about PDF internals, but might already help here. With this, further analysis ideally can concentrate on much less code.

@remi-braun
Copy link
Author

I don't really understand what you mean 😅

Anyway, in the example PDF, the faulty lines are (with corresponding matrices):

Placename: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
Pre-event image: Sentinel-2A/B (2023) (acquired: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
On the 7 October 2023 at 12:00 UTC, extensive flooding is forecast: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
4.5: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
9: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
km: x: 0.0, y: 0.0, cm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0], tm: [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]

They are not that many 🤔

I will use workarounds don't worry.

@stefan6419846
Copy link
Collaborator

To simplify further analysis, it would be great to know which actual pypdf commit(s) lead to this. Checking this should be rather straightforward and not require any deeper PDF knowledge by basically running your example code against every commit of the corresponding release until you see the bad behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-regression Regression introduced as a side-effect of another change workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants