my best attempt of implementing grok in pytorch
grok or grok-1 is a newly opensourced mixture of experts language model by xai.
- This implementation is not intended to be run. It is intended to be a reference for understanding the architecture of the grok model, which is also the reason I wrote this. Personally I also find it easier to reason about a model architecture when shapes are provided via type hints.
- There are probably still many bugs in this implementation. I have not tested it extensively. And it's also possible that I have missed aspects of the architecture. However it should give you an idea of how grok works.
- The original implementation of grok in jax and haiku can be found here.
- certain parts of the code were adapted from x-transformers - MIT License, mema - MIT License and huggingface/transformers - Apache 2.0 License
RoFormer: Enhanced Transformer with Rotary Position Embedding
@misc{su2023roformer,
title={RoFormer: Enhanced Transformer with Rotary Position Embedding},
author={Jianlin Su and Yu Lu and Shengfeng Pan and Ahmed Murtadha and Bo Wen and Yunfeng Liu},
year={2023},
eprint={2104.09864},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{vaswani2017attention,
title={Attention Is All You Need},
author={Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
year={2017},
eprint={1706.03762},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Root Mean Square Layer Normalization
@misc{zhang2019root,
title={Root Mean Square Layer Normalization},
author={Biao Zhang and Rico Sennrich},
year={2019},
eprint={1910.07467},
archivePrefix={arXiv},
primaryClass={cs.LG}
}