Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Issues on GetWords() and crashes with given file #820

Open
stephen-williamson opened this issue Apr 16, 2024 · 2 comments
Open

Memory Issues on GetWords() and crashes with given file #820

stephen-williamson opened this issue Apr 16, 2024 · 2 comments

Comments

@stephen-williamson
Copy link

0020.pdf

I am having a issue with a given PDF, The pdf itself is larger than most that I use pdfPig for. at round 13mb (normally my pdfs are <1mb)
It takes longer than normal to call the GetPage() method (about 5 seconds instead of instant) but it does succeed. While the GetWords() method hangs for a long time (multiple minutes) before eventfully fully crashing.

In that time, memory has shot right up, I end up with 1.5GB GC Heap Size and around 5GiB Allocation Rate looking at the diagnostics session in visual studio.

I cannot even catch the error with a try catch,
Any help would be great, even if it was just to be able to catch the crash nicely. I've attached a snapshot of the memory

var path = @"C:\Users\stephen.williamson\Downloads\0020.pdf";

using (var document = PdfDocument.Open(path))
{
    for (var i = 0; i < document.NumberOfPages; i++)
    {
        var page = document.GetPage(i + 1); //This line takes about 5 seconds
        
        var words = page.GetWords(NearestNeighbourWordExtractor.Instance); //Here it crashes but if i remove the Parameter, it willcrash on the next line instead
        var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
        var orderedBlocks = DefaultReadingOrderDetector.Instance.Get(blocks);

        Console.WriteLine("((TEXT SECTION))");

        foreach (var block in orderedBlocks)
        {
            Console.WriteLine("==BLOCK==");
            Console.WriteLine(block.Text);
           
            // Do something
        }
    }
}

image

@BobLd
Copy link
Collaborator

BobLd commented Apr 21, 2024

@stephen-williamson Thanks for sharing the document. The main issue I see with your document is that the page contains about 2 million letters.... NearestNeighbourWordExtractor was not designed to handle that many letters.

Fixing that involves a deep optimisation of the layout analysis algos. The document you provided will be very usefull for benchmarking though

@BobLd
Copy link
Collaborator

BobLd commented Apr 21, 2024

after further analysis, the letter count can be brought down to 300k by only taking in account the ones that are within the boundary of the page

Related to #681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants