-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Clarification Needed on do_sample and temperature Settings in infinitebench_qa and multi_lexsum
Hi HELMET team,
While reproducing results for infinitebench_qa and multi_lexsum, I observed several inconsistencies and potential sources of confusion related to decoding parameters (do_sample, temperature, top_p).
1. Misleading Warning When do_sample=False
Background
Logically, setting do_sample=False and setting do_sample=True with temperature=0.0 should both imply deterministic (greedy) decoding. In practice, evaluations are typically run with do_sample=False.
However, when do_sample=False is used, a warning indicates that temperature is being ignored.
Example Warning
This can mislead users into thinking their configuration is incorrect, even though this is the intended behavior for deterministic decoding.
Additional Confusion
When enabling TRANSFORMERS_VERBOSITY for more information and debugging, the framework displays:
temperature=0.6top_p=0.9
Although these values are ignored in do_sample=False mode, this is not immediately clear to users, leading them to think they might be using temperature=0.6 and top_p=0.9.
Error on Invalid Combination
When users try to run with:
do_sample=True
temperature=0.0
top_p=1.0
they encounter the following error:
2. Suggested Resolution
Option 1:
Remove temperature and top_p as command-line arguments entirely, since this is an evaluation framework where deterministic decoding is typically required.
Option 2:
Keep them, but:
- Clearly document that for evaluations,
--do_sample=Falseshould be used. - Explicitly state that sampling parameters (
temperature,top_p) are ignored whendo_sample=False. - Consider adjusting or removing the warning message to avoid confusion.
References
Mismatch Between stop_newline and stop_new_line in HELMET Benchmark
While setting up HELMET after a fresh clone and installation, I encountered an error caused by inconsistent parameter naming between the Python code and configuration files.
Observed Behavior
- The file
model_utils.pyusesstop_newlinein multiple locations. - However, the YAML configuration files reference
stop_new_line(with an underscore between “new” and “line”).
This mismatch results in configuration parsing errors and prevents the benchmark from running successfully.
Steps to Reproduce
- Clone HELMET
- Install dependencies as documented
- Install the correct Flash Attention version
- Run
infinitebench_qa
Result: Error occurs.
Temporary Fix
Manually replacing all instances of stop_newline with stop_new_line in model_utils.py resolved the issue and allowed the benchmarks to run.