-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync Math-verify #535
Sync Math-verify #535
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good ! Unglodly amount of regex logic lmao but guess there is no way out ^^'
@@ -265,12 +269,19 @@ def sample_level_fn(golds: list[str], predictions: list[str], formatted_doc: Doc | |||
# We have to use timeout because the sypmy to str conversion can be very slow | |||
try: | |||
add_to_specifics_with_timeout(formatted_doc, extracted_predictions, extracted_golds) | |||
except: # noqa: E722 | |||
except Exception: # noqa: E722 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding exception without using it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some exceptions that don't inherit from Exception (e.g., KeyboardInterrupt). This shouldn't be caught, as we want the program to stop when the user sends it.
@@ -864,7 +858,8 @@ def test_math_extraction_edge_cases(gold, pred, expected): | |||
r"Since $AP:PB = 1:4,$ we can write \[\frac{\overrightarrow{A} - \overrightarrow{P}}{1} = \frac{\overrightarrow{B} - \overrightarrow{P}}{4}.\]Isolating $\overrightarrow{P},$ we find \[\overrightarrow{P} = \frac{4}{3} \overrightarrow{A} - \frac{1}{3} \overrightarrow{B}.\]Thus, $(t,u) = \boxed{\left( \frac{4}{3}, -\frac{1}{3} \right)}.$", | |||
1, | |||
), | |||
(r"$(3,1)$", r"${1,3}$", 1), | |||
# Shouldn't work as it's ordered tuple vs set | |||
# (r"$(3,1)$", r"${1,3}$", 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it commented ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first version of math-verify didn't make a distinction between tuples and finite sets. The new version makes a distinction between them. Now it's very hard to know what was meant to be a set and what wasn't, so we use gold, which can be control the behavior.
Because the gold here is now an ordered tuple and pred is a set with incorrect ordering the result is false. If the gold was either set {3,1} or the pred was {1,3} (we assume the user meant tuple) it would be true.
], | ||
) | ||
def test_math_numina_cases(gold, pred, expected): | ||
assert compare_strings(gold, pred, match_types=["latex", "expr"]) == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they all expect 1, would it be interesting to add cases where it should fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are some with 0, to check that I didn't just do return True hhh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this was mostly to ensure that the new supported stuff works
I have made several changes to math-verify to improve recall on correct answers, this should sync it with lighteval.
There are no more changes planned for it right now, so should be stable from now on