v0.5 Better Coverage of Audio Evaluations and Alignment Check on Stem/Reaosning Benchmarks.
LatestIntroduction
Key Highlights:
- Audio-First: Comprehensive audio evaluation with paralinguistic analysis
- Response Caching: Production-ready caching system for faster re-evaluation
- 5 New Models: Including audio-capable GPT-4o, LongViLA, Gemma-3
- 50+ New Benchmark Variants: Audio, vision, coding, and STEM tasks
- MCP Integration: Model Context Protocol client support
Table of Contents
- Introduction
- Major Features
- Usage Examples
- Technical Details
- Migration Guide
- Bug Fixes and Improvements
- Deprecated Features
- Contributing
- Acknowledgments
- Getting Help
Major Features
1. Response Caching System
A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs:
Key Features:
- Per-document caching: Cached at
(task_name, doc_id)
level - Distributed-safe: Separate cache files per rank/world size
- Zero-overhead: Automatic cache hits with no code changes
- Multi-backend: Works with async OpenAI, vLLM, and custom models
Enable Caching:
export LMMS_EVAL_USE_CACHE=True
export LMMS_EVAL_HOME="/path/to/cache_root" # optional
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \
--tasks mmmu_val \
--batch_size 1 \
--output_path ./logs/
Cache Location:
- Default:
~/.cache/lmms-eval/eval_cache/<model_hash>/{task_name}_rank{rank}_world_size{world_size}.jsonl
- Each line:
{"doc_id": <doc_id>, "response": <string>}
API Integration:
def generate_until(self, requests):
self.load_cache()
cached, pending = self.get_response_from_cache(requests)
results = [c["response"] for c in cached]
for req in pending:
out = call_backend(req)
self.add_request_response_to_cache(req, out)
results.append(out)
return results
See full documentation in docs/caching.md
.
2. Audio Evaluation Suite
Comprehensive audio understanding capabilities with three major benchmark families:
Step2 Audio Paralinguistic (11 tasks)
Fine-grained paralinguistic feature evaluation:
- Acoustic Features: pitch, rhythm, speed, voice_tone, voice_styles
- Speaker Attributes: age, gender, emotions
- Environmental: scene, event, vocalsound
- Sematic Match metrics
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic \
--batch_size 1
VoiceBench (9 main categories, 30+ subtasks)
Comprehensive voice and speech evaluation:
- Instruction Following: ifeval, alpacaeval, advbench
- Reasoning: bbh (Big Bench Hard), commoneval
- Knowledge: mmsu (13 subject areas: biology, chemistry, physics, etc.)
- Q&A: openbookqa
- Accent Diversity: sd-qa (11 regional variants: USA, UK, India, Australia, etc.)
- Expressiveness: wildvoice
- Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc.
# Full VoiceBench
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks voicebench \
--batch_size 1
# Specific accent evaluation
python -m lmms_eval \
--tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \
--batch_size 1
WenetSpeech (2 splits)
Large-scale ASR and speech evaluation:
- dev: Development set for validation
- test_meeting: Meeting domain evaluation
- MER (Mixed Error Rate) metrics
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks wenet_speech_dev,wenet_speech_test_meeting \
--batch_size 1
Audio Pipeline Features:
- HuggingFace audio dataset integration
- Unified audio message format
- Multiple metric support (Accuracy, WER, GPT-4 Judge)
- Task grouping for multi-subset benchmarks
3. New Model Support
Five new model integrations expanding audio and vision capabilities:
Model | Type | Key Features | Usage Example |
---|---|---|---|
GPT-4o Audio Preview | Audio+Text | Paralinguistic understanding, multi-turn audio | --model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17 |
Gemma-3 | Vision+Text | Enhanced video handling, efficient architecture | --model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it |
LLaVA-OneVision 1.5 | Vision+Text | Improved vision understanding, latest LLaVA | --model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b |
LongViLA-R1 | Video+Text | Long-context video, efficient video processing | --model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B |
Thyme | Vision+Text | Reasoning-focused, enhanced image handling | --model thyme --model_args pretrained=thyme-ai/thyme-7b |
Example Usage:
# GPT-4o Audio Preview for audio tasks
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench \
--batch_size 1
# LongViLA for video understanding
python -m lmms_eval \
--model longvila \
--model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \
--tasks videomme,egoschema \
--batch_size 1
4. New Benchmarks
Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains:
Vision & Reasoning Benchmarks
Benchmark | Variants | Focus | Metrics |
---|---|---|---|
CSBench | 3 (MCQ, Assertion, Combined) | Code understanding, debugging | Accuracy |
SciBench | 4 (Math, Physics, Chemistry, Combined) | College-level STEM | GPT-4 Judge, Accuracy |
MedQA | 1 | Medical question answering | Accuracy |
SuperGPQA | 1 | Graduate-level science Q&A | Accuracy |
Lemonade | 1 | Video action recognition | Accuracy |
CharXiv | 3 (Descriptive, Reasoning, Combined) | Scientific chart interpretation | Accuracy, GPT-4 Judge |
Example Usage:
# Code understanding
python -m lmms_eval --tasks csbench --batch_size 1
# STEM reasoning
python -m lmms_eval --tasks scibench --batch_size 1
# Chart reasoning
python -m lmms_eval --tasks charxiv --batch_size 1
Reproducibility Validation
We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility:
Model | Task | lmms-eval | Reported | Δ | Status |
---|---|---|---|---|---|
Qwen-2.5-7B-Instruct | MedQA | 53.89 | 54.28 | -0.39 | ✓ |
SciBench | 43.86 | 42.97 | +0.89 | ✓ | |
CSBench | 69.01 | 69.51 | -0.50 | ✓ | |
SuperGPQA | 29.24 | 28.78 | +0.46 | ✓ | |
Llama-3.1-8B | MedQA | 64.49 | 67.01 | -2.52 | ✓ |
SciBench | 15.35 | 10.78 | +4.57 | +- | |
CSBench | 62.49 | 57.87 | +4.62 | +- | |
SuperGPQA | 21.94 | 19.72 | +2.22 | ✓ |
Status Legend: ✓ = Strong agreement (Δ ≤ 2.5%) | +- = Acceptable variance (2.5% < Δ ≤ 5%)
5. Model Context Protocol (MCP) Integration
Support for MCP-enabled models with tool calling:
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \
--tasks mmmu_val \
--batch_size 1
Features:
- Tool call parsing and execution
- Multi-step reasoning with tools
- Custom MCP server integration
- See
examples/chat_templates/tool_call_qwen2_5_vl.jinja
for templates
6. Async OpenAI Improvements
Enhanced async API integration:
- Better rate limit handling
- Configurable retry logic with delays
- Improved error handling
- Batch size optimization for OpenAI-compatible endpoints
Common Args Support:
# Now supports additional parameters
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \
--tasks mmstar
Usage Examples
Audio Evaluation with Caching
# Enable caching for expensive audio API calls
export LMMS_EVAL_USE_CACHE=True
export OPENAI_API_KEY="your-key"
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench \
--batch_size 8 \
--output_path ./audio_results/ \
--log_samples
# Second run will use cache - much faster!
Multi-Benchmark Evaluation
# Evaluate across audio, vision, and reasoning tasks
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20 \
--tasks voicebench_mmsu,csbench,scibench_math,charxiv \
--batch_size 4 \
--output_path ./multimodal_results/
Distributed Evaluation with Caching
export LMMS_EVAL_USE_CACHE=True
torchrun --nproc_per_node=8 -m lmms_eval \
--model qwen2_5_vl \
--model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \
--tasks step2_audio_paralinguistic,csbench,scibench \
--batch_size 16 \
--output_path ./distributed_results/
Programmatic API with Caching
import os
from lmms_eval.evaluator import simple_evaluate
from lmms_eval.models.chat.async_openai import AsyncOpenAICompatibleChat
# Enable caching
os.environ["LMMS_EVAL_USE_CACHE"] = "True"
model = AsyncOpenAICompatibleChat(
model_version="gpt-4o-audio-preview-2024-12-17",
base_url="https://api.openai.com/v1"
)
results = simple_evaluate(
model=model,
tasks=["voicebench", "step2_audio_paralinguistic"],
batch_size=8,
device="cuda"
)
print(f"Results: {results['results']}")
Technical Details
Caching Architecture
Design Philosophy:
- Simplicity: JSONL format for easy inspection and debugging
- Distributed-safe: Per-rank files avoid write contention
- Transparent: No code changes needed for models using the API
Cache Key: (task_name, doc_id)
- Stable across runs if task and document IDs don't change
- Model hash derived from
model_version
and task list
File Structure:
~/.cache/lmms-eval/eval_cache/
└── <model_hash>/
├── task1_rank0_world_size1.jsonl
├── task1_rank1_world_size1.jsonl
└── task2_rank0_world_size1.jsonl
Performance:
- Initial run: Full model inference
- Cached run: ~100x faster (I/O bound only)
- Distributed: Linear scaling with cache hits
Audio Processing Pipeline
Data Flow:
- Load HuggingFace audio datasets
- Convert to unified message format with audio URLs
- Process through audio-capable models
- Apply task-specific metrics (WER, accuracy, GPT-4 judge)
- Aggregate across task groups
Message Format:
{
"role": "user",
"content": [
{"type": "audio", "url": "path/to/audio.wav"},
{"type": "text", "text": "Question about the audio"}
]
}
Model Context Protocol
MCP enables models to call external tools during evaluation:
- Custom server implementation
- Tool definition and parsing
- Multi-step reasoning with tool results
- Compatible with OpenAI-style function calling
Migration Guide
From v0.4 to v0.5
No Breaking Changes: v0.5 is fully backward compatible with v0.4.
New Features to Adopt:
- Enable Caching for API Models:
# Add these environment variables
export LMMS_EVAL_USE_CACHE=True
- Use New Audio Models:
# GPT-4o Audio Preview
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17
- Leverage New Benchmarks:
# Add audio, code, and STEM benchmarks
--tasks step2_audio_paralinguistic,voicebench,csbench,scibench
- Optimize Async OpenAI Calls:
# Use additional parameters for better control
model_args="model_version=gpt-4o,temperature=0.7,max_tokens=2048"
Updating Existing Workflows
Before (v0.4):
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-08-06 \
--tasks mmmu_val \
--batch_size 1
After (v0.5 with caching):
export LMMS_EVAL_USE_CACHE=True
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-2024-11-20 \
--tasks mmmu_val,voicebench,csbench \
--batch_size 8 # Higher batch size with caching
Bug Fixes and Improvements
Fixed Issues
-
write_out
Flag Deprecated: The--write_out
flag is now deprecated in favor of--log_samples
# Old (deprecated) --write_out # New --log_samples
-
TypeError in
write_out
withlog_samples
: Fixed crash when using both flags together -
Batch Size in OpenAI Endpoint: Corrected batch size handling for OpenAI-compatible servers
-
Gemma-3 Loading: Fixed model loading to use
Gemma3ForConditionalGeneration
correctly -
SRT API Bugfix: Resolved issues in subtitle/caption processing
-
CharXiv Improvements: Fixed chart understanding task configurations
-
Async OpenAI Caching Order: Corrected cache lookup order to avoid unnecessary API calls
Performance Improvements
- 10-100x speedup on cached evaluations
- Better async handling for API-based models
- Reduced memory usage in distributed settings
- Faster audio dataset loading from HuggingFace
Deprecated Features
Deprecated Flags
--write_out
: Use--log_samples
instead# Deprecated python -m lmms_eval --write_out # Use instead python -m lmms_eval --log_samples
Model Notes
- Models should implement caching API for best performance
- Legacy simple models continue to work but miss caching benefits
- See
lmms_eval.api.model.lmms
for caching integration
Contributing
We welcome contributions to LMMS-Eval! The v0.5 release demonstrates the value of community contributions across models, benchmarks, and infrastructure.
High-Priority Areas for v0.5.x
- Audio Model Integrations: Help add support for more audio-capable models
- Audio Benchmark Implementations: Expand audio evaluation coverage
- Caching Optimizations: Improve cache hit rates and performance
- Documentation: Enhance guides and examples for audio evaluation
- MCP Server Examples: Create reference implementations for tool calling
How to Contribute
- Fork the repository and create a feature branch from
dev/v0d5
- Follow the development guidelines in
CLAUDE.md
:- Use
uv
for package management (never pip) - Add type hints and docstrings
- Run
uv run ruff format .
anduv run ruff check . --fix
- Run
uv run pyright
for type checking
- Use
- Test thoroughly:
- Add tests for new features
- Verify caching works if implementing a model
- Test with realistic datasets
- Submit a pull request with clear description
Adding New Audio Benchmarks
Follow the pattern in existing audio tasks:
# In tasks/your_audio_task/utils.py
def doc_to_messages(doc):
return [{
"role": "user",
"content": [
{"type": "audio", "url": doc["audio_path"]},
{"type": "text", "text": doc["question"]}
]
}]
See lmms_eval/tasks/step2_audio_paralinguistic/
and lmms_eval/tasks/voicebench/
for examples.
Adding Caching to Custom Models
Implement the caching API in your model's generate_until
:
class MyModel(lmms):
def generate_until(self, requests):
# Load cache
self.load_cache()
# Separate cached vs pending
cached, pending = self.get_response_from_cache(requests)
# Process pending requests
for req in pending:
response = self.my_inference_logic(req)
self.add_request_response_to_cache(req, response)
return [c["response"] for c in cached] + pending_responses
See lmms_eval/models/chat/async_openai.py
for a complete example.
Acknowledgments
The v0.5 release was made possible by contributions from the LMMS-Eval community:
Core Contributors
- Audio Evaluation Suite: Implementation of Step2 Audio Paralinguistic, VoiceBench, and WenetSpeech benchmarks
- Caching Infrastructure: Design and implementation of the JSONL caching system
- Model Integrations: Support for GPT-4o Audio Preview, Gemma-3, LLaVA-OneVision 1.5, LongViLA-R1, and Thyme
- Benchmark Additions: CSBench, SciBench, Lemonade, and CharXiv implementations
- MCP Integration: Model Context Protocol client and tool calling support
- Bug Fixes: Numerous fixes to async OpenAI, batch handling, and model loading
Special Thanks
- Community members who reported issues and provided feedback
- Contributors who improved documentation and examples
- Researchers who shared benchmark datasets and evaluation protocols
Getting Help
Documentation
- Main README:
README.md
- Quick start and overview - Model Guide:
docs/model_guide.md
- Adding new models - Task Guide:
docs/task_guide.md
- Implementing new benchmarks - Caching Guide:
docs/caching.md
- Detailed caching documentation - Commands:
docs/commands.md
- CLI reference
Support Channels
- GitHub Issues: Report bugs or request features at lmms-eval/issues
- GitHub Discussions: Ask questions and share ideas at lmms-eval/discussions
- Documentation: Check the
docs/
directory for implementation guides
Common Questions
Q: How do I enable caching?
export LMMS_EVAL_USE_CACHE=True
Q: Where are cache files stored?
~/.cache/lmms-eval/eval_cache/<model_hash>/
Q: How do I evaluate audio models?
python -m lmms_eval \
--model async_openai \
--model_args model_version=gpt-4o-audio-preview-2024-12-17 \
--tasks step2_audio_paralinguistic,voicebench
Q: Can I use caching with distributed evaluation?
Yes! Caching works seamlessly with multi-GPU/multi-node evaluation. Each rank maintains its own cache file.
Q: What's the difference between --write_out
and --log_samples
?
--write_out
is deprecated. Use --log_samples
to save individual sample results.
Version: 0.5.0
Release Date: October 2025
Previous Version: v0.4 Release Notes