Skip to content

v0.2.0

Compare
Choose a tag to compare
@madroidmaq madroidmaq released this 16 Dec 15:26
· 38 commits to main since this release

Key Features

  • Enhanced Function Calling (Tools) parsing accuracy to mitigate LLM output instability issues
  • Added model caching support to eliminate reload time when using the same model multiple times

Function Calling test results using madroid/glaive-function-calling-openai dataset:

For Llama3.2 3B 4bit model:

  • Accuracy improved from 2.9% to 99.6%
  • Average latency reduced from 10.81s to 4.24s

image

For Qwen2.5 3B 4bit model:

  • Accuracy improved from 48.4% to 99.0%
  • Average latency reduced from 13.22s to 4.89s

image

Performance comparison with Ollama:

  • MLX achieves higher TPS (77.6) compared to Ollama (57.6)
  • 34.7% speed advantage while generating more tokens

Example: Web Search with Function Calling
Thanks to the significant improvement in function calling accuracy, you can now perform web searches using phidata web agentic even with a 4-bit quantized 3B model. Here's how it works:

Implementation:
image

Result:
image

New Features

  • Added prefill response support for pre-populating LLM outputs
  • Implemented stream_options for token statistics in stream responses
  • Added support for custom stop tokens configuration

Improvements

  • Reorganized code structure for better maintainability
  • Added more code examples

Full Changelog: v0.1.2...v0.2.0