Unable to retrieve text from some PDF documents #796

securigy · 2024-03-15T01:17:15Z

I am using the library for a while now. However, today I noticed that if I save the content on the web as PDF using Microsoft PDF driver (that is, printing to PDF) then the code is unable to retrieve the text.
Here is one of such examples that I print to PDF:
https://healingthebody.ca/4-natural-proven-cancer-remedies/

and here is the code:

         `using (PdfDocument document = PdfDocument.Open(fileStream))
          {
                PdfDocInfo pdfDocInfo = new PdfDocInfo()
                {
                    DocFilePath = fileName,
                    TotalPages = document.NumberOfPages,
                    Version = document.Version,
                    Title = document.Information.Title,
                    Subject = document.Information.Subject,
                    Author = document.Information.Author,
                    DateCreated = dateCreated,
                    DateModified = dateModified,
                };

                string docText = "";
                string pattern = @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])";
                    
                foreach (Page page in document.GetPages())
                {
                    docText += ContentOrderTextExtractor.GetText(page, true);
                }

               // At this point docText is empty because each page delivers empty string through this GetText API`
         }

Any remedy for this?

The text was updated successfully, but these errors were encountered:

BobLd · 2024-04-14T14:17:35Z

@securigy can you provide the exact pdf you used (generated from the html page I assume)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to retrieve text from some PDF documents #796

Unable to retrieve text from some PDF documents #796

securigy commented Mar 15, 2024

BobLd commented Apr 14, 2024

Unable to retrieve text from some PDF documents #796

Unable to retrieve text from some PDF documents #796

Comments

securigy commented Mar 15, 2024

BobLd commented Apr 14, 2024