Add Script to Evaluate Khoj on Google's FRAMES benchmark #955

debanjum · 2024-11-02T12:07:13Z

Why

We need better, automated evals to measure performance shifts of Khoj
across prompt, model and capability changes.

Google's FRAMES benchmark evaluates multi-step retrieval and reasoning
capabilities of AI agents. It's a good starter benchmark to evaluate Khoj.

Details

This PR adds an eval script to evaluate Khoj responses on the
the FRAMES benchmark prompts against the ground truth provided by it.

Gemini is used as an LLM Judge to auto grade Khoj responses vs ground truth
data from the benchmark.

Google's FRAMES benchmark evaluates multi-step retrieval and reasoning capabilities of an agent. The script uses Gemini as an LLM Judge to evaluate Khoj responses to the FRAMES benchmark prompts against the ground truth provided by it.

debanjum added 3 commits November 2, 2024 04:57

Run prompt batches in parallel for faster eval runs

791eb20

Use logger instead of print to track eval

1ccbf72

debanjum force-pushed the add-script-to-eval-khoj-on-frames-benchmark branch from b1de0f2 to 1ccbf72 Compare November 4, 2024 08:40

debanjum changed the title ~~Add script to evaluate khoj on Google's FRAMES benchmark~~ Add Script to Evaluate Khoj on Google's FRAMES benchmark Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Script to Evaluate Khoj on Google's FRAMES benchmark #955

Add Script to Evaluate Khoj on Google's FRAMES benchmark #955

debanjum commented Nov 2, 2024

Add Script to Evaluate Khoj on Google's FRAMES benchmark #955

Are you sure you want to change the base?

Add Script to Evaluate Khoj on Google's FRAMES benchmark #955

Conversation

debanjum commented Nov 2, 2024

Why

Details