About OOM During Training and Questions Regarding Attn #46

Zheng-Jay · 2024-07-02T14:02:08Z

Thank you for your contribution! I have encountered some issues.
1、Full train
Here is my training script:

CUDA_VISIBLE_DEVICES="0,5" torchrun --nproc_per_node 2 \
-m training.run \
--output_dir ./output/7-2_full \
--model_name_or_path /mnt/data1/zmj/embedding_model/GritLM-7B/model \
--train_data /mnt/data1/zmj/embedding_model/gritlm-main/gritlm/training/toy_data/test_7-2.jsonl \
--learning_rate 1e-5 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--dataloader_drop_last True \
--normalized True \
--temperature 0.02 \
--query_max_len 32 \
--passage_max_len 128 \
--train_group_size 2 \
--mode embedding \
--attn cccc

Why do I get an OOM (Out of Memory) error? My GPU is 80G A800, and the model is only 7B with a batch size of 1. I believe this configuration should not cause an OOM.

2、LoRA train
To be able to perform training, I used the --lora option. However, after training, the checkpoint saved is 24GB, while the original model was only 14GB:

27G     /mnt/data1/zmj/embedding_model/gritlm-main/gritlm/output/7-2_lora/checkpoint-800

I would like to know why this is the case. Additionally, I received the following warning when loading:
Some weights of the model checkpoint at /mnt/data1/zmj/embedding_model/gritlm-main/gritlm/output/7-2_lora were not used when initializing MistralForCausalLM: ['model.base_model.model.embed_tokens.weight', 'model.base_model.model.layers.0.input_layernorm.weight', 'model.base_model.model.layers.0.mlp.down_proj.weight', 'model.base_model.model.layers.0.mlp.gate_proj.weight',...]

3、attn
After reading the paper, I understand that you used bidirectional attn for training the embedding task. However, why does the example script you provided for the embedding task use: --attn cccc

I look forward to your response.

The text was updated successfully, but these errors were encountered:

Muennighoff · 2024-07-02T16:39:31Z

1,
try adding the below to save memory

    --attn_implementation sdpa \
    --no_gen_gas \
    --no_emb_gas \
    --split_emb \

2,
I have not tried LoRA but it looks to me like your checkpoint was saved in FP32 which doubles the size. The warning is problematic cuz it means ur weights are not loaded.

3,

it is just an example script; the actual training script is here: https://github.com/ContextualAI/gritlm/blob/main/scripts/training/train_gritlm_7b.sh

Zheng-Jay · 2024-07-03T14:22:13Z

1, try adding the below to save memory1, 尝试添加以下内容以节省内存
    --attn_implementation sdpa \
    --no_gen_gas \
    --no_emb_gas \
    --split_emb \
2, I have not tried LoRA but it looks to me like your checkpoint was saved in FP32 which doubles the size. The warning is problematic cuz it means ur weights are not loaded.2, 我没有尝试过 LoRA，但在我看来，您的检查点保存在 FP32 中，大小增加了一倍。警告是有问题的，因为它意味着您的砝码没有加载。

3,3,

it is just an example script; the actual training script is here: https://github.com/ContextualAI/gritlm/blob/main/scripts/training/train_gritlm_7b.sh它只是一个示例脚本;实际的训练脚本在这里：https://github.com/ContextualAI/gritlm/blob/main/scripts/training/train_gritlm_7b.sh

Thank you very much for your reply. I'm not very familiar with training code, but I will give it a try. Thanks again!

GeraldWu23 · 2024-07-23T09:29:47Z

1, try adding the below to save memory1, 尝试添加以下内容以节省内存
    --attn_implementation sdpa \
    --no_gen_gas \
    --no_emb_gas \
    --split_emb \
2, I have not tried LoRA but it looks to me like your checkpoint was saved in FP32 which doubles the size. The warning is problematic cuz it means ur weights are not loaded.2, 我没有尝试过 LoRA，但在我看来，您的检查点保存在 FP32 中，大小增加了一倍。警告是有问题的，因为它意味着您的砝码没有加载。
3,3,
it is just an example script; the actual training script is here: https://github.com/ContextualAI/gritlm/blob/main/scripts/training/train_gritlm_7b.sh它只是一个示例脚本;实际的训练脚本在这里：https://github.com/ContextualAI/gritlm/blob/main/scripts/training/train_gritlm_7b.sh
Thank you very much for your reply. I'm not very familiar with training code, but I will give it a try. Thanks again!

Does this solution work for you? I am also working on this

GeraldWu23 · 2024-07-23T09:33:31Z

@Muennighoff is loading the model with different gpus available? I try removing torchrun and run with python on multiple gpus. I got device not on the same device error; input tensors on cuda: 0 and model in different cudas.

Muennighoff · 2024-07-23T15:04:07Z

I recommend using torchrun for multiple GPUs; I haven't tested it without torchrun on multiple GPUs but it should also work maybe after some small modifications

GeraldWu23 · 2024-07-26T09:11:25Z

@Muennighoff I am currently working on finetuning 7B model on multiple gpus; 7b model doesnt fit in one 80G GPU, so running on parallel GPUS like your demo seems not possible. I added device_map="auto" to use multiple gpus, but I keep getting "tensors on different devices" issue. Do you have any idea about that, or do you have any recommandation on finetuning 7B with n * 80G GPUs?

yhshu · 2024-09-02T18:41:00Z

@Muennighoff I am currently working on finetuning 7B model on multiple gpus; 7b model doesnt fit in one 80G GPU, so running on parallel GPUS like your demo seems not possible. I added device_map="auto" to use multiple gpus, but I keep getting "tensors on different devices" issue. Do you have any idea about that, or do you have any recommandation on finetuning 7B with n * 80G GPUs

The same problem here, I cannot train with torchrun and 80G GPUs, then I tried device_map="auto" but still cannot work. Do you have ideas about this? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About OOM During Training and Questions Regarding Attn #46

About OOM During Training and Questions Regarding Attn #46

Zheng-Jay commented Jul 2, 2024

Muennighoff commented Jul 2, 2024

Zheng-Jay commented Jul 3, 2024

GeraldWu23 commented Jul 23, 2024

GeraldWu23 commented Jul 23, 2024

Muennighoff commented Jul 23, 2024

GeraldWu23 commented Jul 26, 2024

yhshu commented Sep 2, 2024

About OOM During Training and Questions Regarding Attn #46

About OOM During Training and Questions Regarding Attn #46

Comments

Zheng-Jay commented Jul 2, 2024

Muennighoff commented Jul 2, 2024

Zheng-Jay commented Jul 3, 2024

GeraldWu23 commented Jul 23, 2024

GeraldWu23 commented Jul 23, 2024

Muennighoff commented Jul 23, 2024

GeraldWu23 commented Jul 26, 2024

yhshu commented Sep 2, 2024