-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'extract_text' text matrix seems to be sometimes broken with v5.1.0 #2932
Comments
Thanks for the report. We had some changes to the text extraction code for version 5.1.0, although I am not really sure why it would affect these positions. |
Hi, Just to notify this is still happening with 5.2.0. This breaks my file validation (I'm looking for lines correctly spelleds and placed in the PDF) by breaking the text visitor. This wasn't happening with 5.1.0 |
Is this a typo or do you want to indicate that version 5.2.0 changed/broke more stuff than 5.1.0? Apart from this: With pypdf being a FOSS project, I will happily accept PRs which fix this behavior. I will not have any resources for digging into this myself in the near future, thus anyone who wants to look into this (pinpointing the relevant change, identifying why it broke it, preparing a fix) is highly appreciated as usual. |
Sadly, I think that 5.2.0 broke more stuff than 5.1.0 😢 Honestly, I would love to help as I often do, but I just don't understand the whereabouts of PDF parsing :( . |
While it does not solve the actual issue, getting to know the corresponding merge commits/PRs is something which should not require knowledge about PDF internals, but might already help here. With this, further analysis ideally can concentrate on much less code. |
I don't really understand what you mean 😅 Anyway, in the example PDF, the faulty lines are (with corresponding matrices):
They are not that many 🤔 I will use workarounds don't worry. |
To simplify further analysis, it would be great to know which actual pypdf commit(s) lead to this. Checking this should be rather straightforward and not require any deeper PDF knowledge by basically running your example code against every commit of the corresponding release until you see the bad behavior. |
Extracting text used to extract all words, now at least one is missing from the bounding box
Environment
Both Linux and Windows.
v5.0.1 has been tested and is fine.
Code + PDF
With this PDF:
EMSR718_AOI02_DEL_PRODUCT_18000_map_v1.pdf
Running this snippet:
With pypdf v5.1.0, the output is:
With pypdf v5.0.1, the output is:
The "Road" word is missing. After some checks, I see in the new version that x, y for Road is set to 0, 0 which is really weird.
The text was updated successfully, but these errors were encountered: