-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharding the dataset not completing? #25
Comments
Maybe your disk is full |
@bombless , I thought this to originally so i moved it and add a local cache folder to that disk with the same results. I have plenty of room still. I also tried it with the 100B dataset and it did the same thing on the last shard but at a different percent? I have started using this dataset to train on at home and will see if there is any negative results, perhaps I will just delete that shard encase it cut of mid sentence or something. |
I don't believe this is an issue as the dataset is ~10B not exactly 10B tokens, so if you look in fine_web.py you'll see that once we have tokenized the last document it is simply written to file even though it is not filled, as we still want that last portion of data. The progress bar simply isn't calibrated as it is written still expecting 100,000,000 tokens in each shard even though for the last shard there isn't that much data. I can look into editing the progress bar, so it is a little prettier. Tldr: You are still properly tokenizing and using all the data from the dataset, even though the last shard doesn't say its filled. |
+1 for @alexanderbowler |
I do not think they download one shard at a time you will need to limit the number of rows to download. Here is some good documentation for you - https://huggingface.co/docs/datasets/en/nlp_load |
Below what what I get everytime I try to shard the dataset, it dose not look like the last one is completing I ran this multiply time and each time it stops in the same spot. Any ideas?
Shard 97: 100%|█████████████████████████████████████████████████▉| 99999910/100000000 [00:10<00:00, 9236426.65tokens/s]
Shard 98: 100%|█████████████████████████████████████████████████▉| 99999499/100000000 [00:11<00:00, 8723382.11tokens/s]
Shard 99: 54%|██████████████████████████▉ | 53989101/100000000 [00:08<00:07, 6051927.02tokens/s]
PS E:\build-nanogpt-master\build-nanogpt-master>
The text was updated successfully, but these errors were encountered: