[BUG] The problem of splitting transformer layers when pipeline parallelism cannot be evenly divided. #1304

Baibaifan · 2024-11-27T12:32:53Z

Describe the bug

GPT2 Models

num-layers=30, pipeline-model-parallel-size=4
Don't use decoder-first-pipeline-num-layers and decoder-last-pipeline-num-layers

stage1: 0,1,2,3,4,5,6
stage2: 7,8,9,10,11,12,13
stage3: 14,15,16,17,18,19,20
stage4: 21,22,23,24,25,26,27

sum layers: 28 layers not equal to 30 layers.

In the legacy version, there is a judgment on the number of model layers.

In the Mcore version, only num-layers-per-virtual-pipeline-stage can be used to determine the number of model layers.

I think if users are required to split the model layer themselves due to imbalance, judgment and necessary warnings should be added here.

Environment (please complete the following information):

The text was updated successfully, but these errors were encountered:

Provide feedback