Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce deepscaler #33

Open
merlinarer opened this issue Feb 17, 2025 · 9 comments
Open

Reproduce deepscaler #33

merlinarer opened this issue Feb 17, 2025 · 9 comments

Comments

@merlinarer
Copy link

merlinarer commented Feb 17, 2025

Image
After approximately 1,000 steps, an interesting shift occurs for our 8K run: response length begins to increase again. However, this leads to diminishing returns—accuracy plateaus and eventually declines. At the same time, the response clipping ratio rises from 4.2% to 6.5%, indicating that more responses are being truncated at the context limit.

Hello, I try to reproduce deepscaler on my 8*A80G machine without any change to the code, but found a different trend (wandb shown below):

  • in my exp, response_length/mean begins to increase at step 400
  • response_length/clip_ratio decreased much slower and continued decreasing even at step 1300
  • evaluation score shows a steady trend after 800 step
    Image

This, in my understand, indicates that 8k is still enough as clip_ratio dont increase, maybe 8k should contiue for more steps ?

@michaelzhiluo
Copy link
Contributor

  1. Probably run longer, our 8k runs got corrupted but it went to 1600 steps!
  2. Our 8k run had a bug where we were maximizing kl loss not minimizing. This could possibly help performance ;)

@kaiyliu
Copy link

kaiyliu commented Feb 17, 2025

Image
After approximately 1,000 steps, an interesting shift occurs for our 8K run: response length begins to increase again. However, this leads to diminishing returns—accuracy plateaus and eventually declines. At the same time, the response clipping ratio rises from 4.2% to 6.5%, indicating that more responses are being truncated at the context limit.在大约1,000步之后,我们的8 K跑发生了一个有趣的变化:响应长度再次开始增加。然而,这会导致收益率下降-准确性平台并最终下降。与此同时,响应截断率从4.2%上升到6.5%,表明更多的响应在上下文限制下被截断。

Hello, I try to reproduce deepscaler on my 8A80G machine without any change to the code, but found a different trend (wandb shown below):你好,我尝试在我的8 A80 G机器上重现deepscaler而不对代码进行任何更改,但发现了不同的趋势(wandb如下所示):

  • in my exp, response_length/mean begins to increase at step 400在我实验中,response_length/mean在步骤400开始增加
  • response_length/clip_ratio decreased much slower and continued decreasing even at step 1300response_length/clip_ratio减小得慢得多,并且甚至在步骤1300继续减小
  • evaluation score shows a steady trend after 800 step800步后,评价得分呈稳定趋势
    Image

This, in my understand, indicates that 8k is still enough as clip_ratio dont increase, maybe 8k should contiue for more steps ?这,在我的理解,表明8k仍然是足够的剪辑比率不增加,也许8k应该继续更多的步骤?

@michaelzhiluo
Thank you for your work.

What GPU are you using? And how should I adjust the parameters if I only have 8 GPUs with 40G? I found that simply changing data.train_batch_size from 256 to 128 is OOM. it works fine with data.train_batch_size=16. Why is that?

@michaelzhiluo
Copy link
Contributor

You should set train batch size to be much larger, it will only OOM if your microbatch size is too large; train batch size shouldn't matter.

Recommend batch size of at least 128-256 and 8-16 samples per problems. That way you can get meaningful gradients, not super noisy ones as is a common problem in RL.

@kaiyliu
Copy link

kaiyliu commented Feb 17, 2025

You should set train batch size to be much larger, it will only OOM if your microbatch size is too large; train batch size shouldn't matter.

Recommend batch size of at least 128-256 and 8-16 samples per problems. That way you can get meaningful gradients, not super noisy ones as is a common problem in RL.

#!/bin/bash
set -x

# Warning: Export VLLM_ATTENTION_BACKEND on every machine before starting Ray cluster.
# vLLM without XFORMERS will results in CUDA errors.
export VLLM_ATTENTION_BACKEND=XFORMERS

# Parse command line arguments
while [[ $# -gt 0 ]]; do
    case $1 in
        --model)
            MODEL_PATH="$2"
            shift 2
            ;;
        *)
            break
            ;;
    esac
done

# Set default model path if not provided
if [ -z "$MODEL_PATH" ]; then
    MODEL_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
fi

# Train over a single node, 8 A100-80GB GPUs.
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$HOME/deepscaler/data/train.parquet \
    data.val_files=$HOME/deepscaler/data/aime.parquet \
    data.train_batch_size=128 \
    data.val_batch_size=256 \
    data.max_prompt_length=1024 \
    data.max_response_length=8192 \
    actor_rollout_ref.model.path=$MODEL_PATH  \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size=2 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=2 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=32768 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.temperature=0.6 \
    actor_rollout_ref.rollout.val_temperature=0.6 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.n_val=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='deepscaler' \
    trainer.experiment_name=$EXPERIMENT_NAME \
    +trainer.val_before_train=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=20 \
    trainer.test_freq=20 \
    trainer.default_hdfs_dir=null \
    trainer.total_epochs=30 "${@:1}"

I set all *_micro_batch_size to 2. It will still OOM.

@Doris2022
Copy link

  1. Probably run longer, our 8k runs got corrupted but it went to 1600 steps!
  2. Our 8k run had a bug where we were maximizing kl loss not minimizing. This could possibly help performance ;)

Question: Have you fixed the bug yet?
Could you please show the code where I can fix the kl loss
Thanks a lot!!

@michaelzhiluo
Copy link
Contributor

Image
After approximately 1,000 steps, an interesting shift occurs for our 8K run: response length begins to increase again. However, this leads to diminishing returns—accuracy plateaus and eventually declines. At the same time, the response clipping ratio rises from 4.2% to 6.5%, indicating that more responses are being truncated at the context limit.在大约1,000步之后,我们的8 K跑发生了一个有趣的变化:响应长度再次开始增加。然而,这会导致收益率下降-准确性平台并最终下降。与此同时,响应截断率从4.2%上升到6.5%,表明更多的响应在上下文限制下被截断。

Hello, I try to reproduce deepscaler on my 8_A80G machine without any change to the code, but found a different trend (wandb shown below):你好,我尝试在我的8_ A80 G机器上重现deepscaler而不对代码进行任何更改,但发现了不同的趋势(wandb如下所示):

  • in my exp, response_length/mean begins to increase at step 400在我实验中,response_length/mean在步骤400开始增加
  • response_length/clip_ratio decreased much slower and continued decreasing even at step 1300response_length/clip_ratio减小得慢得多,并且甚至在步骤1300继续减小
  • evaluation score shows a steady trend after 800 step800步后,评价得分呈稳定趋势
    Image

This, in my understand, indicates that 8k is still enough as clip_ratio dont increase, maybe 8k should contiue for more steps ?这,在我的理解,表明8k仍然是足够的剪辑比率不增加,也许8k应该继续更多的步骤?

@michaelzhiluo Thank you for your work.

What GPU are you using? And how should I adjust the parameters if I only have 8 GPUs with 40G? I found that simply changing data.train_batch_size from 256 to 128 is OOM. it works fine with data.train_batch_size=16. Why is that?

A100-80GB GPUs, seems to be the minimum standard these days (as everyone is using H100s now...). Train batch size shouldn't affect OOM, it is either most likely mini batch size or micro batch size. Try finding that? i.e. set micro batch size to 1.

@michaelzhiluo
Copy link
Contributor

  1. Probably run longer, our 8k runs got corrupted but it went to 1600 steps!
  2. Our 8k run had a bug where we were maximizing kl loss not minimizing. This could possibly help performance ;)

Question: Have you fixed the bug yet? Could you please show the code where I can fix the kl loss Thanks a lot!!

This was already fixed. We fixed it after realizing the 8k run has such high KL loss; also Verl has also discovered the same bug here: volcengine/verl@a65c915

@HCHCXY
Copy link

HCHCXY commented Feb 23, 2025

Can you please release the 8K wandb log? I find the same phenomenon as [merlinarer]

@JingyangXiang
Copy link

Can you please release the 8K wandb log? I find the same phenomenon as [merlinarer]

My training log is similar to [merlinarer], too. I use 8xA100 80G to do experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants