Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync Math-verify #535

Merged
merged 6 commits into from
Feb 5, 2025
Merged

Sync Math-verify #535

merged 6 commits into from
Feb 5, 2025

Conversation

hynky1999
Copy link
Collaborator

@hynky1999 hynky1999 commented Feb 5, 2025

I have made several changes to math-verify to improve recall on correct answers, this should sync it with lighteval.
There are no more changes planned for it right now, so should be stable from now on

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@hynky1999 hynky1999 requested a review from clefourrier February 5, 2025 09:29
Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good ! Unglodly amount of regex logic lmao but guess there is no way out ^^'

@@ -265,12 +269,19 @@ def sample_level_fn(golds: list[str], predictions: list[str], formatted_doc: Doc
# We have to use timeout because the sypmy to str conversion can be very slow
try:
add_to_specifics_with_timeout(formatted_doc, extracted_predictions, extracted_golds)
except: # noqa: E722
except Exception: # noqa: E722
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding exception without using it ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some exceptions that don't inherit from Exception (e.g., KeyboardInterrupt). This shouldn't be caught, as we want the program to stop when the user sends it.

@@ -864,7 +858,8 @@ def test_math_extraction_edge_cases(gold, pred, expected):
r"Since $AP:PB = 1:4,$ we can write \[\frac{\overrightarrow{A} - \overrightarrow{P}}{1} = \frac{\overrightarrow{B} - \overrightarrow{P}}{4}.\]Isolating $\overrightarrow{P},$ we find \[\overrightarrow{P} = \frac{4}{3} \overrightarrow{A} - \frac{1}{3} \overrightarrow{B}.\]Thus, $(t,u) = \boxed{\left( \frac{4}{3}, -\frac{1}{3} \right)}.$",
1,
),
(r"$(3,1)$", r"${1,3}$", 1),
# Shouldn't work as it's ordered tuple vs set
# (r"$(3,1)$", r"${1,3}$", 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it commented ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first version of math-verify didn't make a distinction between tuples and finite sets. The new version makes a distinction between them. Now it's very hard to know what was meant to be a set and what wasn't, so we use gold, which can be control the behavior.

Because the gold here is now an ordered tuple and pred is a set with incorrect ordering the result is false. If the gold was either set {3,1} or the pred was {1,3} (we assume the user meant tuple) it would be true.

tests/metrics/test_extractive_match.py Show resolved Hide resolved
],
)
def test_math_numina_cases(gold, pred, expected):
assert compare_strings(gold, pred, match_types=["latex", "expr"]) == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they all expect 1, would it be interesting to add cases where it should fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some with 0, to check that I didn't just do return True hhh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this was mostly to ensure that the new supported stuff works

@hynky1999 hynky1999 merged commit cb35bea into main Feb 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants