What is the purpose of block-by-block training?

Hi, thanks for your great work!

I have a question regarding the block-by-block training method described in the paper.

In Listing 8, the training process for the linear attention branch appears to use the output from the preceding softmax attention layer as its input, rather than the output from the previous linear attention layer.

If this is the case, it seems each linear attention layer is trained independently on the features from the softmax branch. This would mean that the gradients for each linear attention layer are isolated from other layers. Consequently, calculating the loss and backpropagating block-by-block would be functionally equivalent to summing the losses from all layers and performing a single, joint backpropagation at the end.

Is my understanding of this training process correct? If so, does the described block-by-block training offer any computational or performance difference compared to standard joint training?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the purpose of block-by-block training? #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What is the purpose of block-by-block training? #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions