Author | Egor Petrov |
Consultant | Nikita Kiselev |
Advisor | Andrey Grabovoy, PhD |
Training a neural network involves searching for the minimum point of the loss function, which defines the surface in the space of model parameters. The properties of this surface are determined by the chosen architecture, the loss function, and the training data. Existing studies show that as the number of objects in the sample increases, the surface of the loss function ceases to change significantly. The paper obtains an estimate for the convergence of the surface of the loss function for the transformer architecture of a neural network with attention layers, as well as conducts computational experiments that confirm the obtained theoretical results. In this paper, we propose a theoretical estimate for the minimum sample size required to train a model with any predetermined acceptable error, providing experiments that prove the theoretical boundaries.
If you find our work helpful, please cite us.
@article{petrov2025transformerlandscape,
title={Convergence of the loss function surface in transformer neural network architectures},
author={Egor Petrov, Nikita Kiselev, Vladislav Meshkov, Andrey Grabovoy},
year={2025}
}
Our project is MIT licensed. See LICENSE for details.