Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1?

Hi everyone,

I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no _explicitly_ instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.

Some thoughts:
 - I don't have the disk space to build Dolma. Alas!
 - I realize that in Dolma v1.7, FLAN is explicitly included, so that's out.
 - I say "explicitly" instruction-following data because lots of naturally occuring web data have instruction-response-like formats (stackoverflow, etc) -- this is fine; I'm just worried about the increasingly common practice of mixing in explicit "instruction following SFT data" in the pretraining process.
  - I know there's an n-gram viewer at https://wimbd.apps.allenai.org/about, and it says the TULU-style `<|assistant|>` n-gram shows up around 20M times, but... it returns the identical answer for `assistant` without the formatting, so I imagine it's stripping the formatting, so this isn't useful.
  
I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?

See corresponding Olmo request; wasn't sure how much information sharing there would be between the two: https://github.com/allenai/OLMo/issues/658

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? #177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1? #177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions