Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharding the dataset not completing? #25

Open
dustinwloring1988 opened this issue Jun 14, 2024 · 7 comments
Open

Sharding the dataset not completing? #25

dustinwloring1988 opened this issue Jun 14, 2024 · 7 comments

Comments

@dustinwloring1988
Copy link

Below what what I get everytime I try to shard the dataset, it dose not look like the last one is completing I ran this multiply time and each time it stops in the same spot. Any ideas?

Shard 97: 100%|█████████████████████████████████████████████████▉| 99999910/100000000 [00:10<00:00, 9236426.65tokens/s]
Shard 98: 100%|█████████████████████████████████████████████████▉| 99999499/100000000 [00:11<00:00, 8723382.11tokens/s]
Shard 99: 54%|██████████████████████████▉ | 53989101/100000000 [00:08<00:07, 6051927.02tokens/s]
PS E:\build-nanogpt-master\build-nanogpt-master>

@bombless
Copy link

Maybe your disk is full

@dustinwloring1988
Copy link
Author

@bombless , I thought this to originally so i moved it and add a local cache folder to that disk with the same results. I have plenty of room still. I also tried it with the 100B dataset and it did the same thing on the last shard but at a different percent?

I have started using this dataset to train on at home and will see if there is any negative results, perhaps I will just delete that shard encase it cut of mid sentence or something.

@alexanderbowler
Copy link

I don't believe this is an issue as the dataset is ~10B not exactly 10B tokens, so if you look in fine_web.py you'll see that once we have tokenized the last document it is simply written to file even though it is not filled, as we still want that last portion of data. The progress bar simply isn't calibrated as it is written still expecting 100,000,000 tokens in each shard even though for the last shard there isn't that much data. I can look into editing the progress bar, so it is a little prettier. Tldr: You are still properly tokenizing and using all the data from the dataset, even though the last shard doesn't say its filled.

@lukasugar
Copy link

+1 for @alexanderbowler
Yep, the last batch is less than 100M, I've gotten the same numbers:
image

@zzs97str
Copy link

Due to the limited space left and computing source, I just want to get 10 shards to train instead of using the whole dataset. To realize this, I stop running code when there are 3 " downloading data 100% ". Looking through the cache data ,I find that all the filenames are strange numbers and letters, instead of "shard 00000、shrad 000001" something like that. What can I do ? Thanks for suggestios !!
20240630_225652

@dustinwloring1988
Copy link
Author

I do not think they download one shard at a time you will need to limit the number of rows to download. Here is some good documentation for you - https://huggingface.co/docs/datasets/en/nlp_load

@dustinwloring1988
Copy link
Author

@zzs97str

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants