How to improve data sampling? #21
Unanswered
LukasKinder
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Awesome video, thank you very much!
Could you explain how data sampling might be improved?
If I understand correctly, the FineWeb EDU subset documents are concatenated to form one large dataset. This means multiple documents might appear within a single context window during training. Doesn't this risk the LLM learning incorrect relationships?
For example, if a document about nuclear energy in France is followed by one about Mexican America, the LLM may learn to generate tokens about Mexican America if previously seeing text about nuclear energy in France. Ideally the model should learn to ignore tokens that come before an <|endoftext|> because they are from a different context.
Can we prevent this by just shuffling documents every epoch? Or maybe we could even do some fancy things, like mask the attention to prevent tokens from attending to earlier tokens separated by an <|endoftext|>?
Beta Was this translation helpful? Give feedback.
All reactions