Does current code support parallel evaluation of model outputs?

Hi, thanks for sharing this great framework!

I noticed that in the current evaluation loop (in `lmms_eval/evaluator.py`), the processing of model outputs is done **sequentially per document**, as shown in this snippet:

```python
for doc_id, doc in doc_iterator:
    requests = instances_by_doc_id[doc_id]
    metrics = task.process_results(doc, [req.filtered_resps[filter_key] for req in requests])
    ...
```

While model inference itself may be batched, the **post-processing and evaluation** (especially `task.process_results`) — which in many cases involves calling an LLM to compute metrics (e.g., for open-ended generation, visual reasoning, etc.) — is performed **synchronously and one document at a time**.

This becomes a significant bottleneck when evaluation requires additional LLM calls (e.g., using GPT-4 as a judge), resulting in an extremely slow overall evaluation.

Does the current codebase support **parallel or multi-threaded evaluation** during the post-processing stage?  

Thank you so much for your attention and participation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does current code support parallel evaluation of model outputs? #851

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Does current code support parallel evaluation of model outputs? #851

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions