Instruction-tuning models on super-long text #1046

GianlucaDeStefano · 2024-08-28T13:46:58Z

GianlucaDeStefano
Aug 28, 2024

Hi everyone,

I’m working with a custom dataset consisting of extremely long texts—each approximately 100K tokens or more—which cannot be split into smaller subtexts. Each text is paired with a question and answer, requiring the model to reason through the entire context to generate an appropriate response.

The challenge I'm facing is the amount of GPU memory required to fine-tune any model with such lengthy contexts. With access limited to A100 GPUs with up to 80GB of memory, naive parallelism techniques are insufficient.

To address this, I’m considering fine-tuning my models using tensor parallelism with MegatronLM. My goal is to fine-tune LLaMA 3.1, which has an 8K token context window that can be extended to 128K tokens using RoPE.

I’m seeking advice on how to achieve the equivalent functionality of Hugging Face’s DataCollatorForCompletionOnlyLM in MegatronLM. Specifically, I want to compute the loss only for tokens that occur after a designated 'marker' in the text so that the loss will be computed only on the 'answer' the model is tasked to generate.
How can I implement this functionality in megatronLM? is MegatronLM the right tool for my particular usecase?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instruction-tuning models on super-long text #1046

{{title}}

Replies: 0 comments

Select a reply

Instruction-tuning models on super-long text #1046

GianlucaDeStefano Aug 28, 2024

Replies: 0 comments

GianlucaDeStefano
Aug 28, 2024