v0.2.0
Key Features
- Enhanced Function Calling (Tools) parsing accuracy to mitigate LLM output instability issues
- Added model caching support to eliminate reload time when using the same model multiple times
Function Calling test results using madroid/glaive-function-calling-openai dataset:
For Llama3.2 3B 4bit model:
- Accuracy improved from 2.9% to 99.6%
- Average latency reduced from 10.81s to 4.24s
For Qwen2.5 3B 4bit model:
- Accuracy improved from 48.4% to 99.0%
- Average latency reduced from 13.22s to 4.89s
Performance comparison with Ollama:
- MLX achieves higher TPS (77.6) compared to Ollama (57.6)
- 34.7% speed advantage while generating more tokens
Example: Web Search with Function Calling
Thanks to the significant improvement in function calling accuracy, you can now perform web searches using phidata web agentic even with a 4-bit quantized 3B model. Here's how it works:
New Features
- Added prefill response support for pre-populating LLM outputs
- Implemented stream_options for token statistics in stream responses
- Added support for custom stop tokens configuration
Improvements
- Reorganized code structure for better maintainability
- Added more code examples
Full Changelog: v0.1.2...v0.2.0