Why do we use '<|endoftext|>' at the beginning of the sentence? #51
-
Hi, is there anyone notice we are putting the '<|endoftext|>' at the beggining of each sentence when preparing the fineweb data. Sepecifally line 35 of
I have 2 question about this:
Thank you so much if anyone can help |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
endoftext is a bit of a misnomer, it is a document delimiting token. it's especially useful if you want to start sampling new document "from scratch", where you'd pass endoftext into the model at the very first time step. As for 2 you're right, it could very well be cleaner to do "<|endoftext|>Hello, I'm a language model,". This would give the LLM additional information that this is a new document. Do note that during training we sample random windows of the text and train on that, so the model is perfectly "used to" seeing text with no context and it just assumes it's probably somewhere in the middle of a larger document and it does its best. That's basically what ends up happening without passing it in. The model will sample something that is consistent with the prefix, but statistically will match it up to scenarios where it is possibly in the middle of a long document somewhere. TLDR both work. |
Beta Was this translation helpful? Give feedback.
endoftext is a bit of a misnomer, it is a document delimiting token. it's especially useful if you want to start sampling new document "from scratch", where you'd pass endoftext into the model at the very first time step.
As for 2 you're right, it could very well be cleaner to do "<|endoftext|>Hello, I'm a language model,". This would give the LLM additional information that this is a new document.
Do note that during training we sample random windows of the text and train on that, so the model is perfectly "used to" seeing text with no context and it just assumes it's probably somewhere in the middle of a larger document and it does its best. That's basically what ends up happening witho…