Clarification Needed on `do_sample` and `temperature` Settings in `infinitebench_qa` and `multi_lexsum`

# Clarification Needed on `do_sample` and `temperature` Settings in `infinitebench_qa` and `multi_lexsum`

Hi HELMET team,

While reproducing results for `infinitebench_qa` and `multi_lexsum`, I observed several inconsistencies and potential sources of confusion related to decoding parameters (`do_sample`, `temperature`, `top_p`).

---

## 1. Misleading Warning When `do_sample=False`

### Background
Logically, setting `do_sample=False` and setting `do_sample=True` with `temperature=0.0` should both imply deterministic (greedy) decoding. In practice, evaluations are typically run with `do_sample=False`.

However, when `do_sample=False` is used, a warning indicates that **temperature is being ignored**.

### Example Warning
<img width="1426" height="348" alt="Image" src="https://github.com/user-attachments/assets/56b147ad-3428-476e-a797-47b1c3ae0852" />

This can mislead users into thinking their configuration is incorrect, even though this is the intended behavior for deterministic decoding.

### Additional Confusion
When enabling `TRANSFORMERS_VERBOSITY` for more information and debugging, the framework displays:
- `temperature=0.6`
- `top_p=0.9`

<img width="910" height="379" alt="Image" src="https://github.com/user-attachments/assets/9354e832-5ae6-46bf-b4e6-90ab0deb29e1" />

Although these values are ignored in `do_sample=False` mode, this is not immediately clear to users, leading them to think they might be using `temperature=0.6` and `top_p=0.9`.

### Error on Invalid Combination
When users try to run with:

do_sample=True
temperature=0.0
top_p=1.0

they encounter the following error:

<img width="910" height="286" alt="Image" src="https://github.com/user-attachments/assets/2aff8cfe-7778-4a5b-9139-148a8751798c" />

---

## 2. Suggested Resolution

**Option 1:**  
Remove `temperature` and `top_p` as command-line arguments entirely, since this is an evaluation framework where deterministic decoding is typically required.

**Option 2:**  
Keep them, but:
- Clearly document that for evaluations, `--do_sample=False` should be used.  
- Explicitly state that sampling parameters (`temperature`, `top_p`) are ignored when `do_sample=False`.  
- Consider adjusting or removing the warning message to avoid confusion.

### References
- [Hugging Face Discussion](https://huggingface.co/Open-Orca/oo-phi-1_5/discussions/2)  
- [Transformers Text Generation Docs](https://huggingface.co/docs/transformers/en/main_classes/text_generation)

---

# Mismatch Between `stop_newline` and `stop_new_line` in HELMET Benchmark

While setting up HELMET after a fresh clone and installation, I encountered an error caused by inconsistent parameter naming between the Python code and configuration files.

---

## Observed Behavior
- The file `model_utils.py` uses `stop_newline` in multiple locations.  
- However, the YAML configuration files reference `stop_new_line` (with an underscore between “new” and “line”).  

This mismatch results in configuration parsing errors and prevents the benchmark from running successfully.

---

## Steps to Reproduce
1. Clone HELMET  
2. Install dependencies as documented  
3. Install the correct Flash Attention version  
4. Run `infinitebench_qa`  

**Result:** Error occurs.

<img width="1491" height="332" alt="Image" src="https://github.com/user-attachments/assets/9be90824-e2b9-4f65-af33-e44bd733b363" />

---

## Temporary Fix
Manually replacing all instances of `stop_newline` with `stop_new_line` in `model_utils.py` resolved the issue and allowed the benchmarks to run.

---



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification Needed on `do_sample` and `temperature` Settings in `infinitebench_qa` and `multi_lexsum` #35

Clarification Needed on `do_sample` and `temperature` Settings in `infinitebench_qa` and `multi_lexsum`

1. Misleading Warning When `do_sample=False`

Background

Example Warning

Additional Confusion

Error on Invalid Combination

2. Suggested Resolution

References

Mismatch Between `stop_newline` and `stop_new_line` in HELMET Benchmark

Observed Behavior

Steps to Reproduce

Temporary Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification Needed on do_sample and temperature Settings in infinitebench_qa and multi_lexsum #35

Description

Clarification Needed on do_sample and temperature Settings in infinitebench_qa and multi_lexsum

1. Misleading Warning When do_sample=False

Background

Example Warning

Additional Confusion

Error on Invalid Combination

2. Suggested Resolution

References

Mismatch Between stop_newline and stop_new_line in HELMET Benchmark

Observed Behavior

Steps to Reproduce

Temporary Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Clarification Needed on `do_sample` and `temperature` Settings in `infinitebench_qa` and `multi_lexsum` #35

Clarification Needed on `do_sample` and `temperature` Settings in `infinitebench_qa` and `multi_lexsum`

1. Misleading Warning When `do_sample=False`

Mismatch Between `stop_newline` and `stop_new_line` in HELMET Benchmark