Skip to content

Conversation

aman-17
Copy link
Member

@aman-17 aman-17 commented Jun 23, 2025

  1. Added new file to run benchmark on Nanonets OCR
  2. Updated the convert.py to support it.
  3. Updated the Readme.md in bench directory with results.

@aman-17 aman-17 requested a review from jakep-allenai June 23, 2025 22:31

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
cleaned_text = re.sub(r'<page_number>.*?</page_number>', '', output_text[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaned_text = re.sub(
r'<page_number>\d+</page_number>',
'',
output_text[0]
)

Maybe do a \d+ in the page number regexes, but otherwise looks good

@aman-17 aman-17 merged commit 1df93d0 into main Jun 23, 2025
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants