-
Notifications
You must be signed in to change notification settings - Fork 392
Description
Hi, thanks for sharing this great framework!
I noticed that in the current evaluation loop (in lmms_eval/evaluator.py
), the processing of model outputs is done sequentially per document, as shown in this snippet:
for doc_id, doc in doc_iterator:
requests = instances_by_doc_id[doc_id]
metrics = task.process_results(doc, [req.filtered_resps[filter_key] for req in requests])
...
While model inference itself may be batched, the post-processing and evaluation (especially task.process_results
) — which in many cases involves calling an LLM to compute metrics (e.g., for open-ended generation, visual reasoning, etc.) — is performed synchronously and one document at a time.
This becomes a significant bottleneck when evaluation requires additional LLM calls (e.g., using GPT-4 as a judge), resulting in an extremely slow overall evaluation.
Does the current codebase support parallel or multi-threaded evaluation during the post-processing stage?
Thank you so much for your attention and participation.