Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync Math-verify #535

Merged
merged 6 commits into from
Feb 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ multilingual = [
"jieba", # for chinese tokenizer
"pyvi", # for vietnamese tokenizer
]
math = ["latex2sympy2_extended>=0.9.3"]
math = ["latex2sympy2_extended==1.0.4"]

[project.urls]
Homepage = "https://github.com/huggingface/lighteval"
Expand Down
19 changes: 15 additions & 4 deletions src/lighteval/metrics/dynamic_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ def multilingual_extractive_match_metric(
fallback_mode: Literal["no_fallback", "first_match"] = "first_match",
extraction_mode: Literal["first_match", "any_match"] = "any_match",
precision: int = 6,
timeout_seconds: int = 5,
) -> SampleLevelMetric:
"""Creates a language-aware extractive match metric that extracts answers from the model's output.

Expand Down Expand Up @@ -222,6 +223,8 @@ def multilingual_extractive_match_metric(

precision: int
Number of decimal places to use when comparing numerical values. Defaults to 6.
timeout_seconds: int
Timeout for the extraction (each attempt) and comparison. Defaults to 5.

Returns:
A sample level metric that extracts and compares mathematical expressions.
Expand All @@ -245,11 +248,12 @@ def sample_level_fn(golds: list[str], predictions: list[str], formatted_doc: Doc
pred_extraction_regexes = get_extraction_regexes(formatted_doc, pred_extraction_target, language)

extracted_predictions = [
extract_target_from_pred(pred, pred_extraction_regexes, fallback_mode, extraction_mode)
extract_target_from_pred(pred, pred_extraction_regexes, fallback_mode, extraction_mode, timeout_seconds)
for pred in predictions
]
extracted_golds = [
extract_target_from_pred(gold, gold_extraction_regexes, fallback_mode, extraction_mode) for gold in golds
extract_target_from_pred(gold, gold_extraction_regexes, fallback_mode, extraction_mode, timeout_seconds)
for gold in golds
]

# Assert on empty gold and warn on empty pred
Expand All @@ -265,12 +269,19 @@ def sample_level_fn(golds: list[str], predictions: list[str], formatted_doc: Doc
# We have to use timeout because the sypmy to str conversion can be very slow
try:
add_to_specifics_with_timeout(formatted_doc, extracted_predictions, extracted_golds)
except: # noqa: E722
except Exception: # noqa: E722
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding exception without using it ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some exceptions that don't inherit from Exception (e.g., KeyboardInterrupt). This shouldn't be caught, as we want the program to stop when the user sends it.

logger.warning("Timeout when adding extracted predictions and golds to specific")

return aggregation_function(
[
(1.0 if any(compare_gold_target(gold, pred, precision) for gold in extracted_golds) else 0.0)
(
1.0
if any(
compare_gold_target(gold, pred, precision, timeout_seconds=timeout_seconds)
for gold in extracted_golds
)
else 0.0
)
for pred in extracted_predictions
]
)
Expand Down
Loading