Added Nanonets OCR bench results #257

aman-17 · 2025-06-23T22:31:05Z

Added new file to run benchmark on Nanonets OCR
Updated the convert.py to support it.
Updated the Readme.md in bench directory with results.

jakep-allenai · 2025-06-23T23:02:57Z

olmocr/bench/runners/run_nanonetsocr.py

+
+        generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
+        output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
+        cleaned_text = re.sub(r'<page_number>.*?</page_number>', '', output_text[0])


cleaned_text = re.sub(
r'<page_number>\d+</page_number>',
'',
output_text[0]
)

Maybe do a \d+ in the page number regexes, but otherwise looks good

added nanonets

9d04b30

aman-17 requested a review from jakep-allenai June 23, 2025 22:31

jakep-allenai reviewed Jun 23, 2025

View reviewed changes

jakep-allenai approved these changes Jun 23, 2025

View reviewed changes

addressed Jake's comment for pagenumbers with \d+

202e229

aman-17 merged commit 1df93d0 into main Jun 23, 2025
6 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added Nanonets OCR bench results #257

Added Nanonets OCR bench results #257

Uh oh!

aman-17 commented Jun 23, 2025

Uh oh!

jakep-allenai Jun 23, 2025

Uh oh!

Uh oh!

Uh oh!

Added Nanonets OCR bench results #257

Added Nanonets OCR bench results #257

Uh oh!

Conversation

aman-17 commented Jun 23, 2025

Uh oh!

jakep-allenai Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!