Skip to content

Clarification Needed on do_sample and temperature Settings in infinitebench_qa and multi_lexsum #35

@sreewriter

Description

@sreewriter

Clarification Needed on do_sample and temperature Settings in infinitebench_qa and multi_lexsum

Hi HELMET team,

While reproducing results for infinitebench_qa and multi_lexsum, I observed several inconsistencies and potential sources of confusion related to decoding parameters (do_sample, temperature, top_p).


1. Misleading Warning When do_sample=False

Background

Logically, setting do_sample=False and setting do_sample=True with temperature=0.0 should both imply deterministic (greedy) decoding. In practice, evaluations are typically run with do_sample=False.

However, when do_sample=False is used, a warning indicates that temperature is being ignored.

Example Warning

Image

This can mislead users into thinking their configuration is incorrect, even though this is the intended behavior for deterministic decoding.

Additional Confusion

When enabling TRANSFORMERS_VERBOSITY for more information and debugging, the framework displays:

  • temperature=0.6
  • top_p=0.9
Image

Although these values are ignored in do_sample=False mode, this is not immediately clear to users, leading them to think they might be using temperature=0.6 and top_p=0.9.

Error on Invalid Combination

When users try to run with:

do_sample=True
temperature=0.0
top_p=1.0

they encounter the following error:

Image

2. Suggested Resolution

Option 1:
Remove temperature and top_p as command-line arguments entirely, since this is an evaluation framework where deterministic decoding is typically required.

Option 2:
Keep them, but:

  • Clearly document that for evaluations, --do_sample=False should be used.
  • Explicitly state that sampling parameters (temperature, top_p) are ignored when do_sample=False.
  • Consider adjusting or removing the warning message to avoid confusion.

References


Mismatch Between stop_newline and stop_new_line in HELMET Benchmark

While setting up HELMET after a fresh clone and installation, I encountered an error caused by inconsistent parameter naming between the Python code and configuration files.


Observed Behavior

  • The file model_utils.py uses stop_newline in multiple locations.
  • However, the YAML configuration files reference stop_new_line (with an underscore between “new” and “line”).

This mismatch results in configuration parsing errors and prevents the benchmark from running successfully.


Steps to Reproduce

  1. Clone HELMET
  2. Install dependencies as documented
  3. Install the correct Flash Attention version
  4. Run infinitebench_qa

Result: Error occurs.

Image

Temporary Fix

Manually replacing all instances of stop_newline with stop_new_line in model_utils.py resolved the issue and allowed the benchmarks to run.


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions