Architecture #1106

daniel-deychakiwsky · 2024-04-21T04:30:45Z

Hey Meta.

I noticed in the llama one paper it states:

2.2 Architecture
Following recent work on large language models,
our network is based on the transformer architecture (Vaswani et al., 2017). We leverage various
improvements that were subsequently proposed,
and used in different models such as PaLM. Here
are the main difference with the original architecture, and where we were found the inspiration for
this change (in bracket):

Except I don't see a "difference" in that paper indicating the model is decoder-only.

I noticed in the llama two paper it states:

2.2 Training Details
We adopt most of the pretraining setting and model architecture from Llama 1. We use the standard
transformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNorm (Zhang and
Sennrich, 2019), use the SwiGLU activation function (Shazeer, 2020), and rotary positional embeddings
(RoPE, Su et al. 2022).

These publications lead me to believe llama one and two are encoder-decoder models based on the original 2017 transformer architecture. Reading the code in this repo reads as if the model is a decoder-only model which is stated clearly for the new llama three. Can you confirm what the llama one and two architectures are and potentially document that perhaps in this repo?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture #1106

Architecture #1106

daniel-deychakiwsky commented Apr 21, 2024 •

edited

Architecture #1106

Architecture #1106

Comments

daniel-deychakiwsky commented Apr 21, 2024 • edited

daniel-deychakiwsky commented Apr 21, 2024 •

edited