-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely Long Execution Time per Step #29
Comments
In the H200 environment, it took 3 hours for 30 steps. Even if we reduce the total to 5,000 steps, that comes out to 500 hours – meaning that even for 8K training alone, it would require about 20 days (note that the batch size remains as provided in the sh file from GitHub, “train_batch_size=128”). Taking into account the possibility of an environmental issue, I tried running experiments in an A100×8 setup under the environment you described. The result was the same – 394 seconds per step. This is no different from the H100 or H200 environments. However, I have one doubt: Did stages 1 to 3 really finish in 5 days? By calculation, it should have taken 120 hours. While I consider it a remarkable achievement and applaud that you’ve published the code on GitHub and received international recognition for your results, I would like to verify whether the information provided is truly accurate. For example: Perhaps the number of data points is not 40,000. === (main_task pid=15578) step:1 - batch/solve_none:46.000 - batch/solve_all:28.000 - global_seqlen/min:676416.000 - global_seqlen/max:881572.000 - global_seqlen/minmax_diff:205156.000 - global_seqlen/balanced_min:763329.000 - global_seqlen/balanced_max:763330.000 - global_seqlen/mean:763329.500 - actor/kl_loss:0.001 - actor/kl_coef:0.001 - actor/entropy_loss:0.288 - actor/pg_loss:0.118 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.458 - mfu/actor:0.449 - actor/lr:0.000 - critic/score/mean:0.430 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.430 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.057 - critic/advantages/max:2.475 - critic/advantages/min:-2.475 - critic/returns/mean:-0.057 - critic/returns/max:2.475 - critic/returns/min:-2.475 - response_length/mean:5861.707 - response_length/max:8192.000 - response_length/min:217.000 - response_length/clip_ratio:0.442 - prompt_length/mean:101.805 - prompt_length/max:325.000 - prompt_length/min:32.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:261.734 - timing_s/old_log_prob:27.263 - timing_s/ref:24.031 - timing_s/adv:53.904 - timing_s/update_actor:78.578 - timing_s/step:394.366 - timing_per_token_ms/update_actor:0.013 - timing_per_token_ms/adv:0.009 - timing_per_token_ms/ref:0.004 - timing_per_token_ms/gen:0.044 ==== |
can you post your script? |
I did not create any custom scripts. Set XFormers backend to avoid CUDA errorsexport VLLM_ATTENTION_BACKEND=XFORMERS Run 8K context length trainingexport MODEL_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" The dataset creation script and installation were also done exactly as described. When the process was successful, for 8K: Since I reduced the data by half for testing, it was about 6,000 steps. If each step takes 400 seconds, the total training time would be 666 hours. |
|
Thanks. It doesn’t make each step faster. By the way, it says that the 8K training is done on a single node, but is that correct? Does that mean that even for 8K training, multi-node with 32 GPUs was used? |
How long was the average time per step when you ran it? Even if it was done with 32 GPUs, based on calculations, it should have taken one-fourth of 25 days, meaning just for 8K, it should have taken 6 days, isn't it? |
oh sorry, I forget you mentation stage1. Yeah, stage 1 only uses one 1 node. |
Thanks, @lambda7xx -san
|
Hi, Stage 1 takes ~400 seconds for us as well on 8 A100 GPUs, and it is roughly equal to 120 hours of training for 1000 steps, which is 5 days. In our blog post, we said that "In total, the training took around 3,800 A100 hours, equivalent to roughly 5 days on 32 A100s and $4500 in terms of compute cost.". The 5 day total training time is calculated assuming we use 32 A100 GPUs for the entire training. If you normalize 8K run to 32 A100 GPUs, then the first stage will take ~5/4=1.25 day. |
@jeffreysijuntan Let me clarify my understanding. Based on the shell script in this GitHub repository, the number of epochs and data generation script, and the information from your blog, the setup should be 40,000 samples × 30 epochs. With a batch size of 128, as stated in the shell script, this should result in around 10,000 steps in total. However, did you only run 1,000 steps? If so: Please clarify these details. Additionally, if the training was stopped at 1,000 steps, then the implementation should be different from the published code. A MaxStep parameter would be necessary to limit the steps. I genuinely want to verify the facts and confirm the reproducibility of the results. Could you provide clarification? best |
The dataset is not reduced, so you can think of training for roughly 3 epochs. We set If you don't want to stop the run manually, I suggest changing the |
Great, thank you! That makes things much clearer. So, even though the shell script specifies 30 epochs, you actually used checkpoint 1040, meaning you proceeded to the next step using a checkpoint at epoch 3. After that, you ran 16 and 32 context experiments using 32 GPUs. I’ll now verify the reproducibility based on this information. Thanks again! I also now understand how the entire process could be completed in 32 GPUs × 5 days. |
Thank you for your outstanding work.
I would like to ask about something. Based on your report, I estimated that you used approximately 3600 GPU hours on a single A100, which translates to around $4500 in cost. Additionally, it appears that you completed training in 5 days using 8 GPUs for Stage 1 and 32 GPUs (4 nodes × 8 GPUs) for Stages 2 and 3.
However, in my environment, I am using 8× H200 GPUs and have also reduced the dataset size by half to speed up the process. Despite this, each step is taking 396 seconds. Based on this, Stage 1 alone would take approximately 22 days to complete.
(Total global step = 5040 , 390sec/step)
If I were to train on a full dataset of 40,000 samples, it would take approximately 90 days.
I am wondering if there are any specific configurations that could be affecting this performance. If there is an expectation that the step time will significantly decrease in the later stages, I would consider continuing the experiment. However, if a 1.5B model takes 22 days just for Stage 1, the discrepancy from your reported numbers is quite large, making it difficult to adopt this approach.
I understand you must be very busy, but I would greatly appreciate any insights or information you can provide.
Thank you in advance.
Log
(WorkerDict pid=87124) kwargs: {'n': 8, 'logprobs': 1, 'max_tokens': 8192, 'detokenize': False, 'temperature': 0.6, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 7x across cluster]
(main_task pid=85914) "Initial validation metrics: {'val/test_score/': 0.2}"
(main_task pid=85914) step:0 - val/test_score/:0.200
(WorkerDict pid=86835) MICROBATCH STEP
(WorkerDict pid=87121) /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning:
torch.cpu.amp.autocast(args...)
is deprecated. Please usetorch.amp.autocast('cpu', args...)
instead.(WorkerDict pid=87121) with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
(WorkerDict pid=87124) /usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 7x across cluster]
(WorkerDict pid=87124) warnings.warn( [repeated 7x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 24x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(WorkerDict pid=86835) MICROBATCH STEP [repeated 32x across cluster]
(main_task pid=85914) step:1 - batch/solve_none:85.000 - batch/solve_all:11.000 - global_seqlen/min:652897.000 - global_seqlen/max:813828.000 - global_seqlen/minmax_diff:160931.000 - global_seqlen/balanced_min:727119.000 - global_seqlen/balanced_max:727120.000 - global_seqlen/mean:727119.875 - actor/kl_loss:0.000 - actor/kl_coef:0.001 - actor/entropy_loss:0.291 - actor/pg_loss:0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.358 - mfu/actor:1.841 - actor/lr:0.000 - critic/score/mean:0.228 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.228 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.040 - critic/advantages/max:2.475 - critic/advantages/min:-2.475 - critic/returns/mean:-0.040 - critic/returns/max:2.475 - critic/returns/min:-2.475 - response_length/mean:5587.194 - response_length/max:8192.000 - response_length/min:183.000 - response_length/clip_ratio:0.446 - prompt_length/mean:93.430 - prompt_length/max:244.000 - prompt_length/min:31.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:323.768 - timing_s/old_log_prob:12.304 - timing_s/ref:12.797 - timing_s/adv:32.468 - timing_s/update_actor:39.688 - timing_s/step:396.208 - timing_per_token_ms/ref:0.002 - timing_per_token_ms/adv:0.006 - timing_per_token_ms/gen:0.057 - timing_per_token_ms/update_actor:0.007
The text was updated successfully, but these errors were encountered: