Release v0.2.0 · madroidmaq/mlx-omni-server

Key Features

Enhanced Function Calling (Tools) parsing accuracy to mitigate LLM output instability issues
Added model caching support to eliminate reload time when using the same model multiple times

Function Calling test results using madroid/glaive-function-calling-openai dataset:

For Llama3.2 3B 4bit model:

Accuracy improved from 2.9% to 99.6%
Average latency reduced from 10.81s to 4.24s

For Qwen2.5 3B 4bit model:

Accuracy improved from 48.4% to 99.0%
Average latency reduced from 13.22s to 4.89s

Performance comparison with Ollama:

MLX achieves higher TPS (77.6) compared to Ollama (57.6)
34.7% speed advantage while generating more tokens

Example: Web Search with Function Calling
Thanks to the significant improvement in function calling accuracy, you can now perform web searches using phidata web agentic even with a 4-bit quantized 3B model. Here's how it works:

Implementation:

Result:

New Features

Added prefill response support for pre-populating LLM outputs
Implemented stream_options for token statistics in stream responses
Added support for custom stop tokens configuration

Improvements

Reorganized code structure for better maintainability
Added more code examples

Full Changelog: v0.1.2...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0

Key Features

New Features

Improvements