Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LanguageCrossEntropy logs nan when bash pruning.sh #34

Open
YanxiZSQ opened this issue Dec 6, 2023 · 6 comments
Open

LanguageCrossEntropy logs nan when bash pruning.sh #34

YanxiZSQ opened this issue Dec 6, 2023 · 6 comments

Comments

@YanxiZSQ
Copy link

YanxiZSQ commented Dec 6, 2023

I have a issue, I used the two data sets you provided: [book,github]
the mds_sample_redpajama is this:
image
and I fix the pruning.sh
image
then I train the model , This still happens:
image

@xiamengzhou
Copy link
Contributor

xiamengzhou commented Dec 19, 2023

It's weird to me why it happens.. Have you tried the original set up with 7b domains? Does it cause problems? Meanwhile I will try out the 2 domain set up once I get some compute ready.

@PengWenChen
Copy link

PengWenChen commented Dec 21, 2023

Hi @xiamengzhou,
I also encounter this issue with the original dynamic loading setup in pruning.sh.
set_names=[cc,github,book,stackexchange,wiki,arxiv,c4]
proportion=[0.67,0.045,0.045,0.02,0.045,0.025,0.15]

And NaN happens in the first batch when calculating metric/train/stackexchange_LanguageCrossEntropy.

The environment I use is the same as yours except that flash attn is 2.3.6.
The sample data for pruning is 0.1B.

@xiamengzhou
Copy link
Contributor

Could you try the processed data I have here: https://drive.google.com/drive/folders/1WPIRx2NGkNBDswqZZh-hwI1h-QiKVCuN
And see if the same issue occurs again?
@PengWenChen @YanxiZSQ

@PengWenChen
Copy link

Hi @xiamengzhou! Thanks for your reply.
However, I can not access google drive where I am working :(
Could you please upload the processed data to this repository?
It would really help a lot!

@PengWenChen
Copy link

PengWenChen commented Jan 4, 2024

Hi, @xiamengzhou!
The proportion updating fails because of NaN loss on evaluation data. And it is because of the missing data of some subdatasets.
I solved this issue by increasing the number of evaluation sequence to 3500!

However, during normal training (update L_prune), the nan still happens due to the same reason (missing data of some subdatasets), but L_prune can still be updated.
I would like to confirm the correctness of this part!
Is this normal to get nan in train/metric/xx_LanguageCrossEntropy ?
Thank you.

@xiamengzhou
Copy link
Contributor

Hi! It's normal to get nan for some batches when the sampled batch does not contain data for a specific domain, usually because the sampling ratio for that domain is low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants