Skip to content

What is the purpose of block-by-block training? #15

@t1101675

Description

@t1101675

Hi, thanks for your great work!

I have a question regarding the block-by-block training method described in the paper.

In Listing 8, the training process for the linear attention branch appears to use the output from the preceding softmax attention layer as its input, rather than the output from the previous linear attention layer.

If this is the case, it seems each linear attention layer is trained independently on the features from the softmax branch. This would mean that the gradients for each linear attention layer are isolated from other layers. Consequently, calculating the loss and backpropagating block-by-block would be functionally equivalent to summing the losses from all layers and performing a single, joint backpropagation at the end.

Is my understanding of this training process correct? If so, does the described block-by-block training offer any computational or performance difference compared to standard joint training?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions