Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsr is slow also for small files and without (or with few) modules #639

Open
blenzi opened this issue Oct 24, 2022 · 2 comments
Open

Parsr is slow also for small files and without (or with few) modules #639

blenzi opened this issue Oct 24, 2022 · 2 comments

Comments

@blenzi
Copy link

blenzi commented Oct 24, 2022

Following the discussion on #510, I am testing Parsr also for small files and without (or with few) modules. With README.pdf provided in the samples (8 pages):

[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Using extractor: PDFJsExtractor
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Running extractor PDF.js
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): executing command: qpdf --decrypt --no-warn /tmp/f2f1cf2c1053576eca2a6acd83e045/a02a5859e0d4634f2e54dd4cb23680.pdf /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Qpdf repair succeed --> /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): executing command: mutool clean -g /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Mutool clean succeed --> /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Elapsed time: 1.428s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Exporting json...
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Writing file: /tmp/ba66b32a7782915beef6706b8fdc9a.json
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running cleaner...
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: OutOfPageRemovalModule, Options: {}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.005s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: WhitespaceRemovalModule, Options: {"minWidth":0}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.02s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: RedundancyDetectionModule, Options: {"minOverlap":0.5}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.073s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: HeaderFooterDetectionModule, Options: {"ignorePages":[],"maxMarginPercentage":15,"similaritySizePercentage":10}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Detecting marginals (headers and footers) with maxMarginPercentage: 15 ...
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Document margins for maxMarginPercentage 15: top: 125, bottom: 715, left: undefined, right: 559
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.013s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: ReadingOrderDetectionModule, Options: {"minVerticalGapWidth":5,"minColumnWidthInPagePercent":15}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.07s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Total elapsed time: 0.184s

The first stage already takes 1.5s. The total time to invoke the API and retrieve the done status is over 4s. As a comparison, PyMuPDF takes about 40 ms. For a 40-page document the numbers are 10s vs 200 ms. Any idea how to speed it up? The config is below:

[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Config {
  version: 0.9,
  cleaner: [
    'out-of-page-removal',
    'whitespace-removal',
    'redundancy-detection',
    [
      'header-footer-detection',
      [Object]
    ],
    [
      'reading-order-detection',
      [Object]
    ]
  ],
  extractor: {
    pdf: 'pdfjs',
    ocr: 'tesseract',
    language: [
      'en'
    ]
  },
  output: {
    granularity: 'word',
    includeMarginals: true,
    includeDrawings: false,
    formats: {
      json: true,
      text: false,
      csv: false,
      markdown: false,
      pdf: false
    }
  }
}
@BinaryBrain
Copy link
Collaborator

It seems you have some overhead somewhere else since the total elapsed time is way less than 4s.
What is your pipeline and how do you call Parsr's API?

@blenzi
Copy link
Author

blenzi commented Oct 25, 2022

In order to time the full operation I am using the python client in a jupyter notebook:

%%timeit -n1 -r1

parsr.send_document(
                    file_path=pdf_file, 
                    config_path='/tmp/parsr_config.json', 
                    document_name='Test',
                    save_request_id=True)

while 'progress-percentage' in parsr.get_status()['server_response']:
    time.sleep(0.1)

The client is instantiated with

from parsr_client import ParsrClient
parsr = ParsrClient('localhost:3001')

and the API is running via its docker image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants