Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper 第二节预训练 2.2 节:为什么对不同 size 的数据集都要训练至高达 150B tokens? #24

Open
yucc-leon opened this issue May 28, 2024 · 0 comments

Comments

@yucc-leon
Copy link

image Math 模型使用的数据集大小为 120B tokens 所对比的数据集分别为 - 8.9B tokens - 13.6B tokens - 13.6B×4+10.3B×1+28.0B ×2 ≈120B tokens (如果以上数据有误请纠正我) 意味着最小的数据集可能需要训练接近 20 个 epoch,较大概率出现 overfitting 从而导致性能下降。 一般来说可能更公平的比较是否应该是选择一个更小的数值,例如最小数据集的大小或更小,超过阈值的降采样吗?

想请教下实验中这样的设定是基于什么考虑

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant