Could you provide tokenized continue-pretraining dataset for reproduction? #51

gywlssww · 2024-01-23T07:09:06Z

Could you provide tokenized continue-pretraining dataset for reproduction like pruning dataset?
Is tokenizer.model you provided exactly the same tokenizer as Llama-2?

xiamengzhou · 2024-01-23T14:32:52Z

Yes, we use the same tokenizer as llama-2. We'd love to share the data, but due to the shear amount of it, I am not sure what is the best way to serve it. Let me know if you have any idea!

gywlssww · 2024-01-25T12:55:58Z

Does the size of the dataset exceed the limit of Google Drive, One Drive or dropbox,,?

vmasrani · 2024-07-23T18:54:57Z

+1! Would be very helpful to have the finetuning/continue-pretraining dataset as well to be able to reproduce paper results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Could you provide tokenized continue-pretraining dataset for reproduction? #51

Could you provide tokenized continue-pretraining dataset for reproduction? #51

gywlssww commented Jan 23, 2024

xiamengzhou commented Jan 23, 2024

Uh oh!

gywlssww commented Jan 25, 2024

Uh oh!

vmasrani commented Jul 23, 2024

Uh oh!

Could you provide tokenized continue-pretraining dataset for reproduction? #51

Could you provide tokenized continue-pretraining dataset for reproduction? #51

Comments

gywlssww commented Jan 23, 2024

xiamengzhou commented Jan 23, 2024

Uh oh!

gywlssww commented Jan 25, 2024

Uh oh!

vmasrani commented Jul 23, 2024

Uh oh!