xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
benchmark regex reliability evaluation llm reliability-tools chatgpt cc-by-nc-nd-4 open-compass llm-as-a-judge deepseek-math judge-model reasoning-models open-r1 xverify math-verify
-
Updated
Apr 17, 2025 - Python