Evaluation Results Are Zero Due to Format Inconsistency in DeepSeek-R1-Distill-Qwen-1.5B #16

junkangwu · 2025-02-12T13:09:50Z

Thanks for your valuable work.

When I use your framework to reproduce the baseline performance (DeepSeek-R1-Distill-Qwen-1.5B) and evaluate with your scripts:

./scripts/eval/eval_model.sh --model DeepSeek-R1-Distill-Qwen-1.5B --datasets aime math amc minerva olympiad_bench --output-dir DeepSeek-R1-Distill-Qwen-1.5B

All results are zero. I analyzed the generated content, which contains the ground truth.

Later, I found that the evaluation process exited prematurely in this related code

        # Extract solution.
        if THOUGHT_DELIMITER_START in model_response and THOUGHT_DELIMITER_END in model_response:
            model_solution = model_response.split(THOUGHT_DELIMITER_END)[1]
        else:
            return RewardOutput(reward=self.config.format_error_reward, is_correct=False)

The result generated by DeepSeek-R1-Distill-Qwen-1.5B does not contain THOUGHT_DELIMITER_START but does have THOUGHT_DELIMITER_END. Eventually, I modified it to:

        if THOUGHT_DELIMITER_END in model_response:
            model_solution = model_response.split(THOUGHT_DELIMITER_END)[1]
        else:
            return RewardOutput(reward=self.config.format_error_reward, is_correct=False)

The results were successfully reproduced. I suspect that DeepSeek-R1-Distill-Qwen-1.5B itself has insufficient capability to follow the format, which caused this issue. I hope you can make this adjustment to help subsequent reproductions of the related baseline and prevent similar issues from occurring.

The text was updated successfully, but these errors were encountered:

michaelzhiluo · 2025-02-12T21:14:26Z

I did not observe the same behavior for Deepseek's distilled model! Did you observe such behavior also for our model Deepscaler?

Check the eval logs in the README.md for a deeper dive in the trajectories generated in our eval runs!

junkangwu · 2025-02-13T06:18:22Z

Thank you for your reply. I downloaded DeepScaleR-1.5B-Preview and tested it again; this situation was not observed. However, DeepSeek-R1-Distill-Qwen-1.5B does have instances where THOUGHT_DELIMITER_START <think> is missing.

rucnyz · 2025-02-15T01:17:03Z

I have the same observation.

        if THOUGHT_DELIMITER_END in model_response:
            model_solution = model_response.split(THOUGHT_DELIMITER_END)[1]
        else:
            return RewardOutput(reward=self.config.format_error_reward, is_correct=False)

This indeed works!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Results Are Zero Due to Format Inconsistency in DeepSeek-R1-Distill-Qwen-1.5B #16

Evaluation Results Are Zero Due to Format Inconsistency in DeepSeek-R1-Distill-Qwen-1.5B #16

junkangwu commented Feb 12, 2025

michaelzhiluo commented Feb 12, 2025

junkangwu commented Feb 13, 2025

rucnyz commented Feb 15, 2025

Evaluation Results Are Zero Due to Format Inconsistency in DeepSeek-R1-Distill-Qwen-1.5B #16

Evaluation Results Are Zero Due to Format Inconsistency in DeepSeek-R1-Distill-Qwen-1.5B #16

Comments

junkangwu commented Feb 12, 2025

michaelzhiluo commented Feb 12, 2025

junkangwu commented Feb 13, 2025

rucnyz commented Feb 15, 2025