Open
Description
Arxiv/Blog/Paper Link
https://arxiv.org/abs/2405.07395
Detailed Description

Context
The factorized attention is quadratic in axial attention rather than the full size, which should greatly reduce the computational power needed for the model, both training (more important) and inference, since most of the global weather models seem to be able to do inference on small amounts of compute, but training takes a lot of it.