New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

A/B testing #680

Open

boxabirds opened this issue Feb 17, 2025 · 1 comment

Labels

boxabirds commented Feb 17, 2025 •

edited

Loading

Problem: with such wild variability in output based on not only the LLMs but the prompts, small changes can result in quite significant differences.

Solution: ability to specify a list of prompt variations and a list of different LLMs to try.

You could use Optuna for efficient evaluation (cf DSPy), along with argilla the human evaluation.

boxabirds added the enhancement label

Contributor

sysradium commented Feb 18, 2025

@boxabirds have you got an API in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment