Reproduce deepscaler #33

merlinarer · 2025-02-17T02:24:13Z

After approximately 1,000 steps, an interesting shift occurs for our 8K run: response length begins to increase again. However, this leads to diminishing returns—accuracy plateaus and eventually declines. At the same time, the response clipping ratio rises from 4.2% to 6.5%, indicating that more responses are being truncated at the context limit.

Hello, I try to reproduce deepscaler on my 8*A80G machine without any change to the code, but found a different trend (wandb shown below):

in my exp, response_length/mean begins to increase at step 400
response_length/clip_ratio decreased much slower and continued decreasing even at step 1300
evaluation score shows a steady trend after 800 step

This, in my understand, indicates that 8k is still enough as clip_ratio dont increase, maybe 8k should contiue for more steps ?

michaelzhiluo · 2025-02-17T04:12:48Z

Probably run longer, our 8k runs got corrupted but it went to 1600 steps!
Our 8k run had a bug where we were maximizing kl loss not minimizing. This could possibly help performance ;)

kaiyliu · 2025-02-17T05:40:27Z

After approximately 1,000 steps, an interesting shift occurs for our 8K run: response length begins to increase again. However, this leads to diminishing returns—accuracy plateaus and eventually declines. At the same time, the response clipping ratio rises from 4.2% to 6.5%, indicating that more responses are being truncated at the context limit.在大约1,000步之后，我们的8 K跑发生了一个有趣的变化：响应长度再次开始增加。然而，这会导致收益率下降-准确性平台并最终下降。与此同时，响应截断率从4.2%上升到6.5%，表明更多的响应在上下文限制下被截断。

Hello, I try to reproduce deepscaler on my 8A80G machine without any change to the code, but found a different trend (wandb shown below):你好，我尝试在我的8 A80 G机器上重现deepscaler而不对代码进行任何更改，但发现了不同的趋势（wandb如下所示）：

in my exp, response_length/mean begins to increase at step 400在我实验中，response_length/mean在步骤400开始增加

response_length/clip_ratio decreased much slower and continued decreasing even at step 1300response_length/clip_ratio减小得慢得多，并且甚至在步骤1300继续减小

evaluation score shows a steady trend after 800 step800步后，评价得分呈稳定趋势

This, in my understand, indicates that 8k is still enough as clip_ratio dont increase, maybe 8k should contiue for more steps ?这，在我的理解，表明8k仍然是足够的剪辑比率不增加，也许8k应该继续更多的步骤？

@michaelzhiluo
Thank you for your work.

What GPU are you using? And how should I adjust the parameters if I only have 8 GPUs with 40G? I found that simply changing data.train_batch_size from 256 to 128 is OOM. it works fine with data.train_batch_size=16. Why is that?

michaelzhiluo · 2025-02-17T10:24:06Z

You should set train batch size to be much larger, it will only OOM if your microbatch size is too large; train batch size shouldn't matter.

Recommend batch size of at least 128-256 and 8-16 samples per problems. That way you can get meaningful gradients, not super noisy ones as is a common problem in RL.

kaiyliu · 2025-02-17T13:38:36Z

You should set train batch size to be much larger, it will only OOM if your microbatch size is too large; train batch size shouldn't matter.

Recommend batch size of at least 128-256 and 8-16 samples per problems. That way you can get meaningful gradients, not super noisy ones as is a common problem in RL.

#!/bin/bash
set -x

# Warning: Export VLLM_ATTENTION_BACKEND on every machine before starting Ray cluster.
# vLLM without XFORMERS will results in CUDA errors.
export VLLM_ATTENTION_BACKEND=XFORMERS

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --model)
            MODEL_PATH="$2"
            shift 2
            ;;
        *)
            break
            ;;
    esac
done

# Set default model path if not provided
if [ -z "$MODEL_PATH" ]; then
    MODEL_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
fi

# Train over a single node, 8 A100-80GB GPUs.
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/deepscaler/data/train.parquet \
    data.val_files=$HOME/deepscaler/data/aime.parquet \
    data.train_batch_size=128 \
    data.val_batch_size=256 \
    data.max_prompt_length=1024 \
    data.max_response_length=8192 \
    actor_rollout_ref.model.path=$MODEL_PATH  \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size=2 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=2 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=32768 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.temperature=0.6 \
    actor_rollout_ref.rollout.val_temperature=0.6 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.n_val=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='deepscaler' \
    trainer.experiment_name=$EXPERIMENT_NAME \
    +trainer.val_before_train=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=20 \
    trainer.default_hdfs_dir=null \
    trainer.total_epochs=30 "${@:1}"

I set all *_micro_batch_size to 2. It will still OOM.

Doris2022 · 2025-02-18T03:06:26Z

Probably run longer, our 8k runs got corrupted but it went to 1600 steps!

Our 8k run had a bug where we were maximizing kl loss not minimizing. This could possibly help performance ;)

Question: Have you fixed the bug yet?
Could you please show the code where I can fix the kl loss
Thanks a lot!!

michaelzhiluo · 2025-02-19T07:53:04Z

After approximately 1,000 steps, an interesting shift occurs for our 8K run: response length begins to increase again. However, this leads to diminishing returns—accuracy plateaus and eventually declines. At the same time, the response clipping ratio rises from 4.2% to 6.5%, indicating that more responses are being truncated at the context limit.在大约1,000步之后，我们的8 K跑发生了一个有趣的变化：响应长度再次开始增加。然而，这会导致收益率下降-准确性平台并最终下降。与此同时，响应截断率从4.2%上升到6.5%，表明更多的响应在上下文限制下被截断。

Hello, I try to reproduce deepscaler on my 8_A80G machine without any change to the code, but found a different trend (wandb shown below):你好，我尝试在我的8_ A80 G机器上重现deepscaler而不对代码进行任何更改，但发现了不同的趋势（wandb如下所示）：

in my exp, response_length/mean begins to increase at step 400在我实验中，response_length/mean在步骤400开始增加

response_length/clip_ratio decreased much slower and continued decreasing even at step 1300response_length/clip_ratio减小得慢得多，并且甚至在步骤1300继续减小

evaluation score shows a steady trend after 800 step800步后，评价得分呈稳定趋势

This, in my understand, indicates that 8k is still enough as clip_ratio dont increase, maybe 8k should contiue for more steps ?这，在我的理解，表明8k仍然是足够的剪辑比率不增加，也许8k应该继续更多的步骤？

@michaelzhiluo Thank you for your work.

What GPU are you using? And how should I adjust the parameters if I only have 8 GPUs with 40G? I found that simply changing data.train_batch_size from 256 to 128 is OOM. it works fine with data.train_batch_size=16. Why is that?

A100-80GB GPUs, seems to be the minimum standard these days (as everyone is using H100s now...). Train batch size shouldn't affect OOM, it is either most likely mini batch size or micro batch size. Try finding that? i.e. set micro batch size to 1.

michaelzhiluo · 2025-02-19T07:53:57Z

Probably run longer, our 8k runs got corrupted but it went to 1600 steps!

Our 8k run had a bug where we were maximizing kl loss not minimizing. This could possibly help performance ;)

Question: Have you fixed the bug yet? Could you please show the code where I can fix the kl loss Thanks a lot!!

This was already fixed. We fixed it after realizing the 8k run has such high KL loss; also Verl has also discovered the same bug here: volcengine/verl@a65c915

HCHCXY · 2025-02-23T07:34:03Z

Can you please release the 8K wandb log? I find the same phenomenon as [merlinarer]

JingyangXiang · 2025-02-23T12:18:35Z

Can you please release the 8K wandb log? I find the same phenomenon as [merlinarer]

My training log is similar to [merlinarer], too. I use 8xA100 80G to do experiments.

merlinarer mentioned this issue Feb 17, 2025

Different Entropy Loss Trends Between Original Logs and My Experiments #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce deepscaler #33

Reproduce deepscaler #33

merlinarer commented Feb 17, 2025 •

edited

Loading

michaelzhiluo commented Feb 17, 2025

kaiyliu commented Feb 17, 2025 •

edited

Loading

michaelzhiluo commented Feb 17, 2025

kaiyliu commented Feb 17, 2025

Doris2022 commented Feb 18, 2025

michaelzhiluo commented Feb 19, 2025

michaelzhiluo commented Feb 19, 2025

HCHCXY commented Feb 23, 2025

JingyangXiang commented Feb 23, 2025

Reproduce deepscaler #33

Reproduce deepscaler #33

Comments

merlinarer commented Feb 17, 2025 • edited Loading

michaelzhiluo commented Feb 17, 2025

kaiyliu commented Feb 17, 2025 • edited Loading

michaelzhiluo commented Feb 17, 2025

kaiyliu commented Feb 17, 2025

Doris2022 commented Feb 18, 2025

michaelzhiluo commented Feb 19, 2025

michaelzhiluo commented Feb 19, 2025

HCHCXY commented Feb 23, 2025

JingyangXiang commented Feb 23, 2025

merlinarer commented Feb 17, 2025 •

edited

Loading

kaiyliu commented Feb 17, 2025 •

edited

Loading