Skip to content

Does current code support parallel evaluation of model outputs? #851

@pspdada

Description

@pspdada

Hi, thanks for sharing this great framework!

I noticed that in the current evaluation loop (in lmms_eval/evaluator.py), the processing of model outputs is done sequentially per document, as shown in this snippet:

for doc_id, doc in doc_iterator:
    requests = instances_by_doc_id[doc_id]
    metrics = task.process_results(doc, [req.filtered_resps[filter_key] for req in requests])
    ...

While model inference itself may be batched, the post-processing and evaluation (especially task.process_results) — which in many cases involves calling an LLM to compute metrics (e.g., for open-ended generation, visual reasoning, etc.) — is performed synchronously and one document at a time.

This becomes a significant bottleneck when evaluation requires additional LLM calls (e.g., using GPT-4 as a judge), resulting in an extremely slow overall evaluation.

Does the current codebase support parallel or multi-threaded evaluation during the post-processing stage?

Thank you so much for your attention and participation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions