You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the first step of the pre-processing involves normalizing the text dataset and writing everything to one large txt file. Is it necessary to write everything to this txt file before training the tokenizer? My dataset is ~300GB; concatenating would take a very long time (unless there exists some way of concatenating files in parallel?).
Thoughts?
The text was updated successfully, but these errors were encountered:
I noticed that the first step of the pre-processing involves normalizing the text dataset and writing everything to one large txt file. Is it necessary to write everything to this txt file before training the tokenizer? My dataset is ~300GB; concatenating would take a very long time (unless there exists some way of concatenating files in parallel?).
Thoughts?
The text was updated successfully, but these errors were encountered: