Sharding the dataset not completing? #25

dustinwloring1988 · 2024-06-14T15:26:39Z

Below what what I get everytime I try to shard the dataset, it dose not look like the last one is completing I ran this multiply time and each time it stops in the same spot. Any ideas?

Shard 97: 100%|█████████████████████████████████████████████████▉| 99999910/100000000 [00:10<00:00, 9236426.65tokens/s]
Shard 98: 100%|█████████████████████████████████████████████████▉| 99999499/100000000 [00:11<00:00, 8723382.11tokens/s]
Shard 99: 54%|██████████████████████████▉ | 53989101/100000000 [00:08<00:07, 6051927.02tokens/s]
PS E:\build-nanogpt-master\build-nanogpt-master>

bombless · 2024-06-15T13:41:05Z

Maybe your disk is full

dustinwloring1988 · 2024-06-15T16:44:26Z

@bombless , I thought this to originally so i moved it and add a local cache folder to that disk with the same results. I have plenty of room still. I also tried it with the 100B dataset and it did the same thing on the last shard but at a different percent?

I have started using this dataset to train on at home and will see if there is any negative results, perhaps I will just delete that shard encase it cut of mid sentence or something.

alexanderbowler · 2024-06-21T22:21:10Z

I don't believe this is an issue as the dataset is ~10B not exactly 10B tokens, so if you look in fine_web.py you'll see that once we have tokenized the last document it is simply written to file even though it is not filled, as we still want that last portion of data. The progress bar simply isn't calibrated as it is written still expecting 100,000,000 tokens in each shard even though for the last shard there isn't that much data. I can look into editing the progress bar, so it is a little prettier. Tldr: You are still properly tokenizing and using all the data from the dataset, even though the last shard doesn't say its filled.

lukasugar · 2024-06-24T12:51:12Z

+1 for @alexanderbowler
Yep, the last batch is less than 100M, I've gotten the same numbers:

zzs97str · 2024-06-30T14:59:05Z

Due to the limited space left and computing source, I just want to get 10 shards to train instead of using the whole dataset. To realize this, I stop running code when there are 3 " downloading data 100% ". Looking through the cache data ,I find that all the filenames are strange numbers and letters, instead of "shard 00000、shrad 000001" something like that. What can I do ? Thanks for suggestios !!

dustinwloring1988 · 2024-07-08T20:29:18Z

I do not think they download one shard at a time you will need to limit the number of rows to download. Here is some good documentation for you - https://huggingface.co/docs/datasets/en/nlp_load

dustinwloring1988 · 2024-07-09T02:38:10Z

@zzs97str

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding the dataset not completing? #25

Sharding the dataset not completing? #25

dustinwloring1988 commented Jun 14, 2024

bombless commented Jun 15, 2024

dustinwloring1988 commented Jun 15, 2024

alexanderbowler commented Jun 21, 2024

lukasugar commented Jun 24, 2024

zzs97str commented Jun 30, 2024

dustinwloring1988 commented Jul 8, 2024

dustinwloring1988 commented Jul 9, 2024

Sharding the dataset not completing? #25

Sharding the dataset not completing? #25

Comments

dustinwloring1988 commented Jun 14, 2024

bombless commented Jun 15, 2024

dustinwloring1988 commented Jun 15, 2024

alexanderbowler commented Jun 21, 2024

lukasugar commented Jun 24, 2024

zzs97str commented Jun 30, 2024

dustinwloring1988 commented Jul 8, 2024

dustinwloring1988 commented Jul 9, 2024