-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Hi, thanks for your great work!
I have a question regarding the block-by-block training method described in the paper.
In Listing 8, the training process for the linear attention branch appears to use the output from the preceding softmax attention layer as its input, rather than the output from the previous linear attention layer.
If this is the case, it seems each linear attention layer is trained independently on the features from the softmax branch. This would mean that the gradients for each linear attention layer are isolated from other layers. Consequently, calculating the loss and backpropagating block-by-block would be functionally equivalent to summing the losses from all layers and performing a single, joint backpropagation at the end.
Is my understanding of this training process correct? If so, does the described block-by-block training offer any computational or performance difference compared to standard joint training?