You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Following the discussion on #510, I am testing Parsr also for small files and without (or with few) modules. With README.pdf provided in the samples (8 pages):
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Using extractor: PDFJsExtractor
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Running extractor PDF.js
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): executing command: qpdf --decrypt --no-warn /tmp/f2f1cf2c1053576eca2a6acd83e045/a02a5859e0d4634f2e54dd4cb23680.pdf /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Qpdf repair succeed --> /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): executing command: mutool clean -g /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Mutool clean succeed --> /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 1.428s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Exporting json...
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Writing file: /tmp/ba66b32a7782915beef6706b8fdc9a.json
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running cleaner...
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: OutOfPageRemovalModule, Options: {}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.005s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: WhitespaceRemovalModule, Options: {"minWidth":0}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.02s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: RedundancyDetectionModule, Options: {"minOverlap":0.5}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.073s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: HeaderFooterDetectionModule, Options: {"ignorePages":[],"maxMarginPercentage":15,"similaritySizePercentage":10}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Detecting marginals (headers and footers) with maxMarginPercentage: 15 ...
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Document margins for maxMarginPercentage 15: top: 125, bottom: 715, left: undefined, right: 559
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.013s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: ReadingOrderDetectionModule, Options: {"minVerticalGapWidth":5,"minColumnWidthInPagePercent":15}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.07s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Total elapsed time: 0.184s
The first stage already takes 1.5s. The total time to invoke the API and retrieve the done status is over 4s. As a comparison, PyMuPDF takes about 40 ms. For a 40-page document the numbers are 10s vs 200 ms. Any idea how to speed it up? The config is below:
It seems you have some overhead somewhere else since the total elapsed time is way less than 4s.
What is your pipeline and how do you call Parsr's API?
Following the discussion on #510, I am testing Parsr also for small files and without (or with few) modules. With
README.pdf
provided in the samples (8 pages):The first stage already takes 1.5s. The total time to invoke the API and retrieve the done status is over 4s. As a comparison, PyMuPDF takes about 40 ms. For a 40-page document the numbers are 10s vs 200 ms. Any idea how to speed it up? The config is below:
The text was updated successfully, but these errors were encountered: