Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Results Are Zero Due to Format Inconsistency in DeepSeek-R1-Distill-Qwen-1.5B #16

Open
junkangwu opened this issue Feb 12, 2025 · 3 comments

Comments

@junkangwu
Copy link

Thanks for your valuable work.

When I use your framework to reproduce the baseline performance (DeepSeek-R1-Distill-Qwen-1.5B) and evaluate with your scripts:

./scripts/eval/eval_model.sh --model DeepSeek-R1-Distill-Qwen-1.5B --datasets aime math amc minerva olympiad_bench --output-dir DeepSeek-R1-Distill-Qwen-1.5B

All results are zero. I analyzed the generated content, which contains the ground truth.

Later, I found that the evaluation process exited prematurely in this related code

        # Extract solution.
        if THOUGHT_DELIMITER_START in model_response and THOUGHT_DELIMITER_END in model_response:
            model_solution = model_response.split(THOUGHT_DELIMITER_END)[1]
        else:
            return RewardOutput(reward=self.config.format_error_reward, is_correct=False)

The result generated by DeepSeek-R1-Distill-Qwen-1.5B does not contain THOUGHT_DELIMITER_START but does have THOUGHT_DELIMITER_END. Eventually, I modified it to:

        if THOUGHT_DELIMITER_END in model_response:
            model_solution = model_response.split(THOUGHT_DELIMITER_END)[1]
        else:
            return RewardOutput(reward=self.config.format_error_reward, is_correct=False)

The results were successfully reproduced. I suspect that DeepSeek-R1-Distill-Qwen-1.5B itself has insufficient capability to follow the format, which caused this issue. I hope you can make this adjustment to help subsequent reproductions of the related baseline and prevent similar issues from occurring.

@michaelzhiluo
Copy link
Contributor

I did not observe the same behavior for Deepseek's distilled model! Did you observe such behavior also for our model Deepscaler?

Check the eval logs in the README.md for a deeper dive in the trajectories generated in our eval runs!

@junkangwu
Copy link
Author

Thank you for your reply. I downloaded DeepScaleR-1.5B-Preview and tested it again; this situation was not observed. However, DeepSeek-R1-Distill-Qwen-1.5B does have instances where THOUGHT_DELIMITER_START <think> is missing.

@rucnyz
Copy link

rucnyz commented Feb 15, 2025

I have the same observation.

        if THOUGHT_DELIMITER_END in model_response:
            model_solution = model_response.split(THOUGHT_DELIMITER_END)[1]
        else:
            return RewardOutput(reward=self.config.format_error_reward, is_correct=False)

This indeed works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants