Unkown character format in extracted content #725

drrobotic · 2023-10-20T10:36:39Z

drrobotic
Oct 20, 2023

Hi,
i wanted to extract raw content of some PDFs i got and for older ones its working just fine,
but i have newer ones where the text is not normal readable ASCII, instead another format
which i couldnt really find out what it is.

Older ones content looks like:

BT /F0 10.5 Tf 1 0 0 1 375.591 558.63 Tm (31.)Tj 1 0 0 1 393.08499 558.63 Tm (Mai)Tj 1 0 0 1 412.91 558.63 Tm (2023)Tj
Newer ones look like this (PDFs are similar and its the same position)

BT /F0 10.5 Tf 1 0 0 1 375.591 558.63 Tm (\000\026\000\024\000\021)Tj 1 0 0 1 393.08499 558.63 Tm (\000$\000X\000J\000X\000V\000W)Tj 1 0 0 1 428.66 558.63 Tm (\000\025\000\023\000\025\000\026)Tj

Does anybody know how i decode this stuff?

hhrutter · 2023-10-20T15:49:42Z

hhrutter
Oct 20, 2023
Maintainer

Hello!
Are you using the extract content command?
Can you share a small sample for investigation?

0 replies

drrobotic · 2023-10-20T16:23:31Z

drrobotic
Oct 20, 2023
Author

i use the pdfcpu.ExtractPageContent(...) method the get the content in my program. if you look at the snippets above, there should be the texts "31." "August" "2023", but i get "\000\026\000\024\000\021" "\000$\000X\000J\000X\000V\000W" "\000\025\000\023\000\025\000\026"

0 replies

hhrutter · 2023-10-21T00:11:07Z

hhrutter
Oct 21, 2023
Maintainer

The command returns the raw page content in PDF syntax.
In your case you have octal values representing glyph codes.
The decoding involves knowing what kind of font is used and which encoding.

1 reply

drrobotic Oct 22, 2023
Author

ok thanks, can i find an example of decoding in your project?

hhrutter · 2023-10-23T07:35:19Z

hhrutter
Oct 23, 2023
Maintainer

Sorry not at the moment,
What you are looking for is text extraction, see #122

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unkown character format in extracted content #725

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unkown character format in extracted content #725

drrobotic Oct 20, 2023

Replies: 4 comments · 1 reply

hhrutter Oct 20, 2023 Maintainer

drrobotic Oct 20, 2023 Author

hhrutter Oct 21, 2023 Maintainer

drrobotic Oct 22, 2023 Author

hhrutter Oct 23, 2023 Maintainer

drrobotic
Oct 20, 2023

Replies: 4 comments 1 reply

hhrutter
Oct 20, 2023
Maintainer

drrobotic
Oct 20, 2023
Author

hhrutter
Oct 21, 2023
Maintainer

drrobotic Oct 22, 2023
Author

hhrutter
Oct 23, 2023
Maintainer