Is Concatenation of Data Files Necessary? #6

apteryxlabs · 2020-07-13T02:35:31Z

I noticed that the first step of the pre-processing involves normalizing the text dataset and writing everything to one large txt file. Is it necessary to write everything to this txt file before training the tokenizer? My dataset is ~300GB; concatenating would take a very long time (unless there exists some way of concatenating files in parallel?).

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Concatenation of Data Files Necessary? #6

Is Concatenation of Data Files Necessary? #6

apteryxlabs commented Jul 13, 2020

Is Concatenation of Data Files Necessary? #6

Is Concatenation of Data Files Necessary? #6

Comments

apteryxlabs commented Jul 13, 2020