Instruction-tuning models on super-long text #1046
Unanswered
GianlucaDeStefano
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I’m working with a custom dataset consisting of extremely long texts—each approximately 100K tokens or more—which cannot be split into smaller subtexts. Each text is paired with a question and answer, requiring the model to reason through the entire context to generate an appropriate response.
The challenge I'm facing is the amount of GPU memory required to fine-tune any model with such lengthy contexts. With access limited to A100 GPUs with up to 80GB of memory, naive parallelism techniques are insufficient.
To address this, I’m considering fine-tuning my models using tensor parallelism with MegatronLM. My goal is to fine-tune LLaMA 3.1, which has an 8K token context window that can be extended to 128K tokens using RoPE.
I’m seeking advice on how to achieve the equivalent functionality of Hugging Face’s DataCollatorForCompletionOnlyLM in MegatronLM. Specifically, I want to compute the loss only for tokens that occur after a designated 'marker' in the text so that the loss will be computed only on the 'answer' the model is tasked to generate.
How can I implement this functionality in megatronLM? is MegatronLM the right tool for my particular usecase?
Beta Was this translation helpful? Give feedback.
All reactions