metric
We want to be able to assess if the generated answer is correct given the information in provided in our QA calibration tab.
Correctness should just look at facts as defined in the "expected answer" column relative to the generated answer. The judge should just look at the expected answer and see if the facts contained within that are also present in the generated answer.
THIS IS FOR THE DB BOT ONLY, the NAV BOT WILL HAVE A SEPERATE TICKET