Why do we use '<|endoftext|>' at the beginning of the sentence? #51

XINZHANG-ops · 2024-07-03T01:48:32Z

XINZHANG-ops
Jul 3, 2024

Hi, is there anyone notice we are putting the '<|endoftext|>' at the beggining of each sentence when preparing the fineweb data. Sepecifally line 35 of fineweb.py:

enc = tiktoken.get_encoding("gpt2")
eot = enc._special_tokens['<|endoftext|>'] # end of text token
def tokenize(doc):
    # tokenizes a single document and returns a numpy array of uint16 tokens
    tokens = [eot] # the special <|endoftext|> token delimits all documents
    tokens.extend(enc.encode_ordinary(doc["text"]))

I have 2 question about this:

why do we append it at the beginning of each of document instead of the end?
If we are appendding this at the beginning in raw data, then in line 451 of code train_gpt2.py
tokens = enc.encode("Hello, I'm a language model,"), why here we are not append the '<|endoftext|>' at the beginning to generate the examples? i.e "<|endoftext|>Hello, I'm a language model,"?

Thank you so much if anyone can help

Answered by karpathy

Jul 3, 2024

endoftext is a bit of a misnomer, it is a document delimiting token. it's especially useful if you want to start sampling new document "from scratch", where you'd pass endoftext into the model at the very first time step.

As for 2 you're right, it could very well be cleaner to do "<|endoftext|>Hello, I'm a language model,". This would give the LLM additional information that this is a new document.

Do note that during training we sample random windows of the text and train on that, so the model is perfectly "used to" seeing text with no context and it just assumes it's probably somewhere in the middle of a larger document and it does its best. That's basically what ends up happening witho…

View full answer

karpathy · 2024-07-03T04:00:10Z

karpathy
Jul 3, 2024
Maintainer

endoftext is a bit of a misnomer, it is a document delimiting token. it's especially useful if you want to start sampling new document "from scratch", where you'd pass endoftext into the model at the very first time step.

As for 2 you're right, it could very well be cleaner to do "<|endoftext|>Hello, I'm a language model,". This would give the LLM additional information that this is a new document.

Do note that during training we sample random windows of the text and train on that, so the model is perfectly "used to" seeing text with no context and it just assumes it's probably somewhere in the middle of a larger document and it does its best. That's basically what ends up happening without passing it in. The model will sample something that is consistent with the prefix, but statistically will match it up to scenarios where it is possibly in the middle of a long document somewhere.

TLDR both work.

1 reply

XINZHANG-ops Jul 3, 2024
Author

Thank you so much Andrej, feel so glad receive a reply from you, soooo exicting.

And I'm currently done my pretraining step. And preparting finetune it chat format data.
So my plan is preprepare my data like this:
<|endoftext|>question text<|endoftext|>answer text<|endoftext|>...<|endoftext|>
so I prepare each data point(chat round like above), since I want to make sure my question and answer together can fit in my 1024 window size. And I will filter out any data points above that window size. and for data points less than that, I will pad the end like <|endoftext|>...<|endoftext|> as show above to b excat 1024 token length.

Not sure if I'm doing it correctly. I know you super busy, I can either wair for your futher video on this or testing things out myself if you do not have time for an answer for this. Really a big fan of you, I'm inside AI(LLM) industry myself, my dream will be havine a meal with you some day lol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we use '<|endoftext|>' at the beginning of the sentence? #51

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why do we use '<|endoftext|>' at the beginning of the sentence? #51

XINZHANG-ops Jul 3, 2024

Replies: 1 comment · 1 reply

karpathy Jul 3, 2024 Maintainer

XINZHANG-ops Jul 3, 2024 Author

XINZHANG-ops
Jul 3, 2024

Replies: 1 comment 1 reply

karpathy
Jul 3, 2024
Maintainer

XINZHANG-ops Jul 3, 2024
Author