How to improve data sampling? #21

LukasKinder · 2024-06-13T13:20:26Z

LukasKinder
Jun 13, 2024

Awesome video, thank you very much!

Could you explain how data sampling might be improved?

If I understand correctly, the FineWeb EDU subset documents are concatenated to form one large dataset. This means multiple documents might appear within a single context window during training. Doesn't this risk the LLM learning incorrect relationships?

For example, if a document about nuclear energy in France is followed by one about Mexican America, the LLM may learn to generate tokens about Mexican America if previously seeing text about nuclear energy in France. Ideally the model should learn to ignore tokens that come before an <|endoftext|> because they are from a different context.

Can we prevent this by just shuffling documents every epoch? Or maybe we could even do some fancy things, like mask the attention to prevent tokens from attending to earlier tokens separated by an <|endoftext|>?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve data sampling? #21

{{title}}

Replies: 0 comments

Select a reply

How to improve data sampling? #21

LukasKinder Jun 13, 2024

Replies: 0 comments

LukasKinder
Jun 13, 2024