Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #237

Open
ZetangForward opened this issue Apr 15, 2024 · 1 comment

Comments

@ZetangForward
Copy link

ZetangForward commented Apr 15, 2024

Hi, I just want to train a small version of RWKV-V5-169m model from scratch
I implement it with huggingface:

import torch
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("RWKV/rwkv-4-169m-pile")
config = AutoConfig.from_pretrained("RWKV/rwkv-4-169m-pile")

tiny_rwkv_configs = {
            "num_hidden_layers": 4,
            "hidden_size": 256,
            "intermediate_size": 1024,
            "attention_hidden_size": 256,
            "vocab_size": 20480,
        }

"""
implement config with tiny_rwkv_configs:
e.g., config.num_hidden_layers = tiny_rwkv_configs['num_hidden_layers']
"""

model = AutoModelForCausalLM.from_config(config)

"""
initialize dataloader, optimizer, etc
""" 
for sample in dataloader:
    outputs = model(sample)
    loss = outputs.loss

But, when I backward the loss, I encounter the bug:

You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Sanity Checking: |                                                   | 0/? [00:00<?, ?it/s]/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
Sanity Checking DataLoader 0:   0%|                                  | 0/1 [00:00<?, ?it/s]/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
/nvme1/zecheng/modelzipper/projects/state-space-model/custom_dataset/AR_ywj.py:116: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  attention_mask = torch.tensor(attention_mask, dtype=torch.long)
Epoch 0:   0%| | 3/1398 [00:00<02:55,  7.95it/s, v_num=tzc, train_lm_loss=nan.0, train_ppl=[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa88159617 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7faa8811498d in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7faa88215128 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7faa8914b250 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7faa8914f078 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7faa89165910 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7faa89165c18 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xc819d (0x7faacd94619d in /home/amax/anaconda3/envs/zecheng/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fab09939609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fab0985e353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa88159617 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7faa8811498d in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7faa88215128 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7faa8914b250 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7faa8914f078 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7faa89165910 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7faa89165c18 in /home/amax/anaconda3/envs/zecheng/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xc819d (0x7faacd94619d in /home/amax/anaconda3/envs/zecheng/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fab09939609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fab0985e353 in /lib/x86_64-linux-gnu/libc.so.6)

Worth noting that I train the mode from scratch, and I only implement 4-layer of RWKV with custom setting, the loss becomes nan.0 @www

Does anyone encounter this issue?

@BlinkDL
Copy link
Owner

BlinkDL commented Apr 16, 2024

Hi seems you still need to use RWKV-LM repo to train it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants