Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely Long Execution Time per Step #29

Open
hodachi-axcxept opened this issue Feb 15, 2025 · 12 comments
Open

Extremely Long Execution Time per Step #29

hodachi-axcxept opened this issue Feb 15, 2025 · 12 comments

Comments

@hodachi-axcxept
Copy link

hodachi-axcxept commented Feb 15, 2025

Thank you for your outstanding work.

I would like to ask about something. Based on your report, I estimated that you used approximately 3600 GPU hours on a single A100, which translates to around $4500 in cost. Additionally, it appears that you completed training in 5 days using 8 GPUs for Stage 1 and 32 GPUs (4 nodes × 8 GPUs) for Stages 2 and 3.

However, in my environment, I am using 8× H200 GPUs and have also reduced the dataset size by half to speed up the process. Despite this, each step is taking 396 seconds. Based on this, Stage 1 alone would take approximately 22 days to complete.
(Total global step = 5040 , 390sec/step)
If I were to train on a full dataset of 40,000 samples, it would take approximately 90 days.

I am wondering if there are any specific configurations that could be affecting this performance. If there is an expectation that the step time will significantly decrease in the later stages, I would consider continuing the experiment. However, if a 1.5B model takes 22 days just for Stage 1, the discrepancy from your reported numbers is quite large, making it difficult to adopt this approach.

I understand you must be very busy, but I would greatly appreciate any insights or information you can provide.

Thank you in advance.

Log
(WorkerDict pid=87124) kwargs: {'n': 8, 'logprobs': 1, 'max_tokens': 8192, 'detokenize': False, 'temperature': 0.6, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 7x across cluster]
(main_task pid=85914) "Initial validation metrics: {'val/test_score/': 0.2}"
(main_task pid=85914) step:0 - val/test_score/:0.200
(WorkerDict pid=86835) MICROBATCH STEP
(WorkerDict pid=87121) /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
(WorkerDict pid=87121) with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
(WorkerDict pid=87124) /usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 7x across cluster]
(WorkerDict pid=87124) warnings.warn( [repeated 7x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 24x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(main_task pid=85914) step:1 - batch/solve_none:85.000 - batch/solve_all:11.000 - global_seqlen/min:652897.000 - global_seqlen/max:813828.000 - global_seqlen/minmax_diff:160931.000 - global_seqlen/balanced_min:727119.000 - global_seqlen/balanced_max:727120.000 - global_seqlen/mean:727119.875 - actor/kl_loss:0.000 - actor/kl_coef:0.001 - actor/entropy_loss:0.291 - actor/pg_loss:0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.358 - mfu/actor:1.841 - actor/lr:0.000 - critic/score/mean:0.228 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.228 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.040 - critic/advantages/max:2.475 - critic/advantages/min:-2.475 - critic/returns/mean:-0.040 - critic/returns/max:2.475 - critic/returns/min:-2.475 - response_length/mean:5587.194 - response_length/max:8192.000 - response_length/min:183.000 - response_length/clip_ratio:0.446 - prompt_length/mean:93.430 - prompt_length/max:244.000 - prompt_length/min:31.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:323.768 - timing_s/old_log_prob:12.304 - timing_s/ref:12.797 - timing_s/adv:32.468 - timing_s/update_actor:39.688 - timing_s/step:396.208 - timing_per_token_ms/ref:0.002 - timing_per_token_ms/adv:0.006 - timing_per_token_ms/gen:0.057 - timing_per_token_ms/update_actor:0.007

@hodachi-axcxept
Copy link
Author

In the H200 environment, it took 3 hours for 30 steps. Even if we reduce the total to 5,000 steps, that comes out to 500 hours – meaning that even for 8K training alone, it would require about 20 days (note that the batch size remains as provided in the sh file from GitHub, “train_batch_size=128”).

Taking into account the possibility of an environmental issue, I tried running experiments in an A100×8 setup under the environment you described. The result was the same – 394 seconds per step. This is no different from the H100 or H200 environments.

However, I have one doubt: Did stages 1 to 3 really finish in 5 days? By calculation, it should have taken 120 hours.

While I consider it a remarkable achievement and applaud that you’ve published the code on GitHub and received international recognition for your results, I would like to verify whether the information provided is truly accurate. For example:

Perhaps the number of data points is not 40,000.
Maybe during 8K training, the setup wasn’t a single node with 8 GPUs.
The batch size might have been different.
Or some environmental issue could have led to your per-step timing being much faster.
I want to know the truth. If it indeed took that long, that’s impressive in its own right, but if the cost is significantly different when someone tries to reproduce your results, please let me know.

===
A100×8 Environment①(Vast AI)

(main_task pid=15578) step:1 - batch/solve_none:46.000 - batch/solve_all:28.000 - global_seqlen/min:676416.000 - global_seqlen/max:881572.000 - global_seqlen/minmax_diff:205156.000 - global_seqlen/balanced_min:763329.000 - global_seqlen/balanced_max:763330.000 - global_seqlen/mean:763329.500 - actor/kl_loss:0.001 - actor/kl_coef:0.001 - actor/entropy_loss:0.288 - actor/pg_loss:0.118 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.458 - mfu/actor:0.449 - actor/lr:0.000 - critic/score/mean:0.430 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.430 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.057 - critic/advantages/max:2.475 - critic/advantages/min:-2.475 - critic/returns/mean:-0.057 - critic/returns/max:2.475 - critic/returns/min:-2.475 - response_length/mean:5861.707 - response_length/max:8192.000 - response_length/min:217.000 - response_length/clip_ratio:0.442 - prompt_length/mean:101.805 - prompt_length/max:325.000 - prompt_length/min:32.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:261.734 - timing_s/old_log_prob:27.263 - timing_s/ref:24.031 - timing_s/adv:53.904 - timing_s/update_actor:78.578 - timing_s/step:394.366 - timing_per_token_ms/update_actor:0.013 - timing_per_token_ms/adv:0.009 - timing_per_token_ms/ref:0.004 - timing_per_token_ms/gen:0.044
(WorkerDict pid=16133) MICROBATCH STEP [repeated 8x across cluster]
(WorkerDict pid=16133) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=16133) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=16133) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=16133) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=16133) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=16133) MICROBATCH STEP [repeated 16x across cluster]

====
A100×8 Envrionment②(AWS SageMaker Jupyter Notebook)
(main_task pid=36230) step:1 - batch/solve_none:38.000 - batch/solve_all:33.000 - global_seqlen/min:598507.000 - global_seqlen/max:806827.000 - global_seqlen/minmax_diff:208320.000 - global_seqlen/balanced_min:720932.000 - global_seqlen/balanced_max:720933.000 - global_seqlen/mean:720932.125 - actor/kl_loss:0.001 - actor/kl_coef:0.001 - actor/entropy_loss:0.261 - actor/pg_loss:0.070 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.447 - mfu/actor:0.446 - actor/lr:0.000 - critic/score/mean:0.500 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.500 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.062 - critic/advantages/max:2.475 - critic/advantages/min:-2.475 - critic/returns/mean:-0.062 - critic/returns/max:2.475 - critic/returns/min:-2.475 - response_length/mean:5527.103 - response_length/max:8192.000 - response_length/min:328.000 - response_length/clip_ratio:0.381 - prompt_length/mean:105.180 - prompt_length/max:395.000 - prompt_length/min:33.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:250.058 - timing_s/old_log_prob:21.567 - timing_s/ref:20.905 - timing_s/adv:44.706 - timing_s/update_actor:74.117 - timing_s/step:369.191 - timing_per_token_ms/update_actor:0.013 - timing_per_token_ms/ref:0.004 - timing_per_token_ms/adv:0.008 - timing_per_token_ms/gen:0.044
(WorkerDict pid=37047) MICROBATCH STEP [repeated 8x across cluster]
(WorkerDict pid=37323) /home/ec2-user/anaconda3/envs/myenv/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. [repeated 8x across cluster]
(WorkerDict pid=37323) with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] [repeated 8x across cluster]
(WorkerDict pid=37047) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=37047) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=37047) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=37047) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=37047) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=37047) MICROBATCH STEP [repeated 16x across cluster]
(WorkerDict pid=37047) MICROBATCH STEP [repeated 8x across cluster]

@lambda7xx
Copy link

can you post your script?

@hodachi-axcxept
Copy link
Author

I did not create any custom scripts.
I simply followed the instructions provided in this repository.

Set XFormers backend to avoid CUDA errors

export VLLM_ATTENTION_BACKEND=XFORMERS

Run 8K context length training

export MODEL_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
./scripts/train/run_deepscaler_1.5b_8k.sh --model $MODEL_PATH

The dataset creation script and installation were also done exactly as described.

When the process was successful, for 8K:
• 30 epochs × 50,000 samples = 1,500,000 samples
• With a batch size of 128, that results in 11,718 steps.

Since I reduced the data by half for testing, it was about 6,000 steps.

If each step takes 400 seconds, the total training time would be 666 hours.

@lambda7xx
Copy link

lambda7xx commented Feb 15, 2025

reduced the data by half for testing. means you reduce the 50,000 samples to 25000 samples? BTW, we use 32 x A100 GPU, you use 8× H200 GPUs.

@hodachi-axcxept
Copy link
Author

Thanks.
But it doesn’t change.
I tried with 20,000 samples, but that only reduced the total number of steps to 5,000.

It doesn’t make each step faster.

By the way, it says that the 8K training is done on a single node, but is that correct?

Does that mean that even for 8K training, multi-node with 32 GPUs was used?

Image

@hodachi-axcxept
Copy link
Author

How long was the average time per step when you ran it?
How many hours did the 8K training take to complete?

Even if it was done with 32 GPUs, based on calculations, it should have taken one-fourth of 25 days, meaning just for 8K, it should have taken 6 days, isn't it?

@lambda7xx
Copy link

Thanks. But it doesn’t change. I tried with 20,000 samples, but that only reduced the total number of steps to 5,000.

It doesn’t make each step faster.

By the way, it says that the 8K training is done on a single node, but is that correct?

Does that mean that even for 8K training, multi-node with 32 GPUs was used?

Image

oh sorry, I forget you mentation stage1. Yeah, stage 1 only uses one 1 node.

@hodachi-axcxept
Copy link
Author

Thanks. But it doesn’t change. I tried with 20,000 samples, but that only reduced the total number of steps to 5,000.
It doesn’t make each step faster.
By the way, it says that the 8K training is done on a single node, but is that correct?
Does that mean that even for 8K training, multi-node with 32 GPUs was used?
Image

oh sorry, I forget you mentation stage1. Yeah, stage 1 only uses one 1 node.

Thanks, @lambda7xx -san
and, please tell me the below question.

How long was the average time per step when you ran it?
How many hours did the 8K training take to complete?

Even if it was done with 32 GPUs, based on calculations, it should have taken one-fourth of 25 days, meaning just for 8K, it should have taken 6 days, isn't it?

I just want to clear.

@jeffreysijuntan
Copy link
Contributor

Hi,
First, we do not run for 5000 steps. In our blog post, we mention that we only train for 1000 steps for our 8K run.

Stage 1 takes ~400 seconds for us as well on 8 A100 GPUs, and it is roughly equal to 120 hours of training for 1000 steps, which is 5 days.

In our blog post, we said that "In total, the training took around 3,800 A100 hours, equivalent to roughly 5 days on 32 A100s and $4500 in terms of compute cost.". The 5 day total training time is calculated assuming we use 32 A100 GPUs for the entire training. If you normalize 8K run to 32 A100 GPUs, then the first stage will take ~5/4=1.25 day.

@hodachi-axcxept
Copy link
Author

@jeffreysijuntan
thanks for your reply.

Let me clarify my understanding.

Based on the shell script in this GitHub repository, the number of epochs and data generation script, and the information from your blog, the setup should be 40,000 samples × 30 epochs.

With a batch size of 128, as stated in the shell script, this should result in around 10,000 steps in total.

However, did you only run 1,000 steps? If so:
• Was the dataset reduced to 4,000 samples?
• Or was the number of epochs reduced to 3?

Please clarify these details.

Additionally, if the training was stopped at 1,000 steps, then the implementation should be different from the published code. A MaxStep parameter would be necessary to limit the steps.

I genuinely want to verify the facts and confirm the reproducibility of the results. Could you provide clarification?

best

@jeffreysijuntan
Copy link
Contributor

The dataset is not reduced, so you can think of training for roughly 3 epochs.

We set num_epochs to a large number so that we can let it run for a long time and based on the performance trend, determine when to stop or not. In our experiments, take the checkpoint at step 1040 for the subsequent 16K training.

If you don't want to stop the run manually, I suggest changing the num_epochs to 3. We also welcome PR to our repo, so if you want, you can implement max_steps and submit a PR.

@hodachi-axcxept
Copy link
Author

Great, thank you! That makes things much clearer.

So, even though the shell script specifies 30 epochs, you actually used checkpoint 1040, meaning you proceeded to the next step using a checkpoint at epoch 3.

After that, you ran 16 and 32 context experiments using 32 GPUs.

I’ll now verify the reproducibility based on this information.

Thanks again!

I also now understand how the entire process could be completed in 32 GPUs × 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants