Release v0.5 Better Coverage of Audio Evaluations and Alignment Check on Stem/Reaosning Benchmarks. · EvolvingLMMs-Lab/lmms-eval

Introduction

Key Highlights:

Audio-First: Comprehensive audio evaluation with paralinguistic analysis
Response Caching: Production-ready caching system for faster re-evaluation
5 New Models: Including audio-capable GPT-4o, LongViLA, Gemma-3
50+ New Benchmark Variants: Audio, vision, coding, and STEM tasks
MCP Integration: Model Context Protocol client support

Introduction
Major Features
Usage Examples
Technical Details
Migration Guide
Bug Fixes and Improvements
Deprecated Features
Contributing
Acknowledgments
Getting Help

Major Features

1. Response Caching System

A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs:

Key Features:

Per-document caching: Cached at (task_name, doc_id) level
Distributed-safe: Separate cache files per rank/world size
Zero-overhead: Automatic cache hits with no code changes
Multi-backend: Works with async OpenAI, vLLM, and custom models

Enable Caching:

export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="/path/to/cache_root"  # optional

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \
  --tasks mmmu_val \
  --batch_size 1 \
  --output_path ./logs/

Cache Location:

Default: ~/.cache/lmms-eval/eval_cache/<model_hash>/{task_name}_rank{rank}_world_size{world_size}.jsonl
Each line: {"doc_id": <doc_id>, "response": <string>}

API Integration:

def generate_until(self, requests):
    self.load_cache()
    cached, pending = self.get_response_from_cache(requests)
    results = [c["response"] for c in cached]
    for req in pending:
        out = call_backend(req)
        self.add_request_response_to_cache(req, out)
        results.append(out)
    return results

See full documentation in docs/caching.md.

2. Audio Evaluation Suite

Comprehensive audio understanding capabilities with three major benchmark families:

Step2 Audio Paralinguistic (11 tasks)

Fine-grained paralinguistic feature evaluation:

Acoustic Features: pitch, rhythm, speed, voice_tone, voice_styles
Speaker Attributes: age, gender, emotions
Environmental: scene, event, vocalsound
Sematic Match metrics

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic \
  --batch_size 1

VoiceBench (9 main categories, 30+ subtasks)

Comprehensive voice and speech evaluation:

Instruction Following: ifeval, alpacaeval, advbench
Reasoning: bbh (Big Bench Hard), commoneval
Knowledge: mmsu (13 subject areas: biology, chemistry, physics, etc.)
Q&A: openbookqa
Accent Diversity: sd-qa (11 regional variants: USA, UK, India, Australia, etc.)
Expressiveness: wildvoice
Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc.

# Full VoiceBench
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks voicebench \
  --batch_size 1

# Specific accent evaluation
python -m lmms_eval \
  --tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \
  --batch_size 1

WenetSpeech (2 splits)

Large-scale ASR and speech evaluation:

dev: Development set for validation
test_meeting: Meeting domain evaluation
MER (Mixed Error Rate) metrics

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks wenet_speech_dev,wenet_speech_test_meeting \
  --batch_size 1

Audio Pipeline Features:

HuggingFace audio dataset integration
Unified audio message format
Multiple metric support (Accuracy, WER, GPT-4 Judge)
Task grouping for multi-subset benchmarks

3. New Model Support

Five new model integrations expanding audio and vision capabilities:

Model	Type	Key Features	Usage Example
GPT-4o Audio Preview	Audio+Text	Paralinguistic understanding, multi-turn audio	`--model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17`
Gemma-3	Vision+Text	Enhanced video handling, efficient architecture	`--model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it`
LLaVA-OneVision 1.5	Vision+Text	Improved vision understanding, latest LLaVA	`--model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b`
LongViLA-R1	Video+Text	Long-context video, efficient video processing	`--model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B`
Thyme	Vision+Text	Reasoning-focused, enhanced image handling	`--model thyme --model_args pretrained=thyme-ai/thyme-7b`

Example Usage:

# GPT-4o Audio Preview for audio tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 1

# LongViLA for video understanding
python -m lmms_eval \
  --model longvila \
  --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \
  --tasks videomme,egoschema \
  --batch_size 1

4. New Benchmarks

Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains:

Vision & Reasoning Benchmarks

Benchmark	Variants	Focus	Metrics
CSBench	3 (MCQ, Assertion, Combined)	Code understanding, debugging	Accuracy
SciBench	4 (Math, Physics, Chemistry, Combined)	College-level STEM	GPT-4 Judge, Accuracy
MedQA	1	Medical question answering	Accuracy
SuperGPQA	1	Graduate-level science Q&A	Accuracy
Lemonade	1	Video action recognition	Accuracy
CharXiv	3 (Descriptive, Reasoning, Combined)	Scientific chart interpretation	Accuracy, GPT-4 Judge

Example Usage:

# Code understanding
python -m lmms_eval --tasks csbench --batch_size 1

# STEM reasoning
python -m lmms_eval --tasks scibench --batch_size 1

# Chart reasoning
python -m lmms_eval --tasks charxiv --batch_size 1

Reproducibility Validation

We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility:

Model	Task	lmms-eval	Reported	Δ	Status
Qwen-2.5-7B-Instruct	MedQA	53.89	54.28	-0.39	✓
	SciBench	43.86	42.97	+0.89	✓
	CSBench	69.01	69.51	-0.50	✓
	SuperGPQA	29.24	28.78	+0.46	✓
Llama-3.1-8B	MedQA	64.49	67.01	-2.52	✓
	SciBench	15.35	10.78	+4.57	+-
	CSBench	62.49	57.87	+4.62	+-
	SuperGPQA	21.94	19.72	+2.22	✓

Status Legend: ✓ = Strong agreement (Δ ≤ 2.5%) | +- = Acceptable variance (2.5% < Δ ≤ 5%)

5. Model Context Protocol (MCP) Integration

Support for MCP-enabled models with tool calling:

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \
  --tasks mmmu_val \
  --batch_size 1

Features:

Tool call parsing and execution
Multi-step reasoning with tools
Custom MCP server integration
See examples/chat_templates/tool_call_qwen2_5_vl.jinja for templates

6. Async OpenAI Improvements

Enhanced async API integration:

Better rate limit handling
Configurable retry logic with delays
Improved error handling
Batch size optimization for OpenAI-compatible endpoints

Common Args Support:

# Now supports additional parameters
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \
  --tasks mmstar

Usage Examples

Audio Evaluation with Caching

# Enable caching for expensive audio API calls
export LMMS_EVAL_USE_CACHE=True
export OPENAI_API_KEY="your-key"

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench \
  --batch_size 8 \
  --output_path ./audio_results/ \
  --log_samples

# Second run will use cache - much faster!

Multi-Benchmark Evaluation

# Evaluate across audio, vision, and reasoning tasks
python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20 \
  --tasks voicebench_mmsu,csbench,scibench_math,charxiv \
  --batch_size 4 \
  --output_path ./multimodal_results/

Distributed Evaluation with Caching

export LMMS_EVAL_USE_CACHE=True

torchrun --nproc_per_node=8 -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
  --tasks step2_audio_paralinguistic,csbench,scibench \
  --batch_size 16 \
  --output_path ./distributed_results/

Programmatic API with Caching

import os
from lmms_eval.evaluator import simple_evaluate
from lmms_eval.models.chat.async_openai import AsyncOpenAICompatibleChat

# Enable caching
os.environ["LMMS_EVAL_USE_CACHE"] = "True"

model = AsyncOpenAICompatibleChat(
    model_version="gpt-4o-audio-preview-2024-12-17",
    base_url="https://api.openai.com/v1"
)

results = simple_evaluate(
    model=model,
    tasks=["voicebench", "step2_audio_paralinguistic"],
    batch_size=8,
    device="cuda"
)

print(f"Results: {results['results']}")

Technical Details

Caching Architecture

Design Philosophy:

Simplicity: JSONL format for easy inspection and debugging
Distributed-safe: Per-rank files avoid write contention
Transparent: No code changes needed for models using the API

Cache Key: (task_name, doc_id)

Stable across runs if task and document IDs don't change
Model hash derived from model_version and task list

File Structure:

~/.cache/lmms-eval/eval_cache/
└── <model_hash>/
    ├── task1_rank0_world_size1.jsonl
    ├── task1_rank1_world_size1.jsonl
    └── task2_rank0_world_size1.jsonl

Performance:

Initial run: Full model inference
Cached run: ~100x faster (I/O bound only)
Distributed: Linear scaling with cache hits

Audio Processing Pipeline

Data Flow:

Load HuggingFace audio datasets
Convert to unified message format with audio URLs
Process through audio-capable models
Apply task-specific metrics (WER, accuracy, GPT-4 judge)
Aggregate across task groups

Message Format:

{
    "role": "user",
    "content": [
        {"type": "audio", "url": "path/to/audio.wav"},
        {"type": "text", "text": "Question about the audio"}
    ]
}

Model Context Protocol

MCP enables models to call external tools during evaluation:

Custom server implementation
Tool definition and parsing
Multi-step reasoning with tool results
Compatible with OpenAI-style function calling

Migration Guide

From v0.4 to v0.5

No Breaking Changes: v0.5 is fully backward compatible with v0.4.

New Features to Adopt:

Enable Caching for API Models:

# Add these environment variables
export LMMS_EVAL_USE_CACHE=True

Use New Audio Models:

# GPT-4o Audio Preview
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17

Leverage New Benchmarks:

# Add audio, code, and STEM benchmarks
--tasks step2_audio_paralinguistic,voicebench,csbench,scibench

Optimize Async OpenAI Calls:

# Use additional parameters for better control
model_args="model_version=gpt-4o,temperature=0.7,max_tokens=2048"

Updating Existing Workflows

Before (v0.4):

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-08-06 \
  --tasks mmmu_val \
  --batch_size 1

After (v0.5 with caching):

export LMMS_EVAL_USE_CACHE=True

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-2024-11-20 \
  --tasks mmmu_val,voicebench,csbench \
  --batch_size 8  # Higher batch size with caching

Bug Fixes and Improvements

Fixed Issues

write_out Flag Deprecated: The --write_out flag is now deprecated in favor of --log_samples
```
# Old (deprecated)
--write_out

# New
--log_samples
```
TypeError in write_out with log_samples: Fixed crash when using both flags together
Batch Size in OpenAI Endpoint: Corrected batch size handling for OpenAI-compatible servers
Gemma-3 Loading: Fixed model loading to use Gemma3ForConditionalGeneration correctly
SRT API Bugfix: Resolved issues in subtitle/caption processing
CharXiv Improvements: Fixed chart understanding task configurations
Async OpenAI Caching Order: Corrected cache lookup order to avoid unnecessary API calls

Performance Improvements

10-100x speedup on cached evaluations
Better async handling for API-based models
Reduced memory usage in distributed settings
Faster audio dataset loading from HuggingFace

Deprecated Features

Deprecated Flags

--write_out: Use --log_samples instead

# Deprecated
python -m lmms_eval --write_out

# Use instead
python -m lmms_eval --log_samples

Model Notes

Models should implement caching API for best performance
Legacy simple models continue to work but miss caching benefits
See lmms_eval.api.model.lmms for caching integration

Contributing

We welcome contributions to LMMS-Eval! The v0.5 release demonstrates the value of community contributions across models, benchmarks, and infrastructure.

High-Priority Areas for v0.5.x

Audio Model Integrations: Help add support for more audio-capable models
Audio Benchmark Implementations: Expand audio evaluation coverage
Caching Optimizations: Improve cache hit rates and performance
Documentation: Enhance guides and examples for audio evaluation
MCP Server Examples: Create reference implementations for tool calling

How to Contribute

Fork the repository and create a feature branch from dev/v0d5
Follow the development guidelines in CLAUDE.md:
- Use uv for package management (never pip)
- Add type hints and docstrings
- Run uv run ruff format . and uv run ruff check . --fix
- Run uv run pyright for type checking
Test thoroughly:
- Add tests for new features
- Verify caching works if implementing a model
- Test with realistic datasets
Submit a pull request with clear description

Adding New Audio Benchmarks

Follow the pattern in existing audio tasks:

# In tasks/your_audio_task/utils.py
def doc_to_messages(doc):
    return [{
        "role": "user",
        "content": [
            {"type": "audio", "url": doc["audio_path"]},
            {"type": "text", "text": doc["question"]}
        ]
    }]

See lmms_eval/tasks/step2_audio_paralinguistic/ and lmms_eval/tasks/voicebench/ for examples.

Adding Caching to Custom Models

Implement the caching API in your model's generate_until:

class MyModel(lmms):
    def generate_until(self, requests):
        # Load cache
        self.load_cache()

        # Separate cached vs pending
        cached, pending = self.get_response_from_cache(requests)

        # Process pending requests
        for req in pending:
            response = self.my_inference_logic(req)
            self.add_request_response_to_cache(req, response)

        return [c["response"] for c in cached] + pending_responses

See lmms_eval/models/chat/async_openai.py for a complete example.

Acknowledgments

The v0.5 release was made possible by contributions from the LMMS-Eval community:

Core Contributors

Audio Evaluation Suite: Implementation of Step2 Audio Paralinguistic, VoiceBench, and WenetSpeech benchmarks
Caching Infrastructure: Design and implementation of the JSONL caching system
Model Integrations: Support for GPT-4o Audio Preview, Gemma-3, LLaVA-OneVision 1.5, LongViLA-R1, and Thyme
Benchmark Additions: CSBench, SciBench, Lemonade, and CharXiv implementations
MCP Integration: Model Context Protocol client and tool calling support
Bug Fixes: Numerous fixes to async OpenAI, batch handling, and model loading

Special Thanks

Community members who reported issues and provided feedback
Contributors who improved documentation and examples
Researchers who shared benchmark datasets and evaluation protocols

Getting Help

Documentation

Main README: README.md - Quick start and overview
Model Guide: docs/model_guide.md - Adding new models
Task Guide: docs/task_guide.md - Implementing new benchmarks
Caching Guide: docs/caching.md - Detailed caching documentation
Commands: docs/commands.md - CLI reference

Support Channels

GitHub Issues: Report bugs or request features at lmms-eval/issues
GitHub Discussions: Ask questions and share ideas at lmms-eval/discussions
Documentation: Check the docs/ directory for implementation guides

Common Questions

Q: How do I enable caching?

export LMMS_EVAL_USE_CACHE=True

Q: Where are cache files stored?

~/.cache/lmms-eval/eval_cache/<model_hash>/

Q: How do I evaluate audio models?

python -m lmms_eval \
  --model async_openai \
  --model_args model_version=gpt-4o-audio-preview-2024-12-17 \
  --tasks step2_audio_paralinguistic,voicebench

Q: Can I use caching with distributed evaluation?

Yes! Caching works seamlessly with multi-GPU/multi-node evaluation. Each rank maintains its own cache file.

Q: What's the difference between --write_out and --log_samples?

--write_out is deprecated. Use --log_samples to save individual sample results.

Version: 0.5.0
Release Date: October 2025
Previous Version: v0.4 Release Notes

v0.5 Better Coverage of Audio Evaluations and Alignment Check on Stem/Reaosning Benchmarks.