You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Steps/Code to Reproduce Bug
Please provide minimal steps or a code snippet to reproduce the bug.
Using dataset size of 749,000 samples.
Running fine-tuning on allenai/tulu-3-sft-olmo-2-mixture.
Utilizing the latest NVIDIA drivers.
2025-02-05 07:27:31,421 - distributed.worker - ERROR - Compute Failed
Key: ('lambda-619f7ac64f13a38ca6c6546e6af3af28', 10)
State: executing
Task: <Task ('lambda-619f7ac64f13a38ca6c6546e6af3af28', 10) reify(...)>
Exception: "OutOfMemoryError('CUDA out of memory. Tried to allocate 59.96 GiB. GPU 0 has a total capacity of 79.10 GiB of which 17.46 GiB is free. Process 363562 has 61.61 GiB memory in use. Of the allocated memory 60.08 GiB is allocated by PyTorch, and 284.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)')"
Traceback: ' File "/usr/local/lib/python3.10/dist-packages/dask/bag/core.py", line 1875, in reify\n seq = list(seq)\n File "/usr/local/lib/python3.10/dist-packages/dask/bag/core.py", line 2063, in next\n return self.f(*vals)\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/modules/semantic_dedup.py", line 524, in \n lambda cluster_id: get_semantic_matches_per_cluster(\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/semdedup_utils.py", line 272, in get_semantic_matches_per_cluster\n M, M1 = _semdedup(cluster_reps, "cuda")\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/semdedup_utils.py", line 193, in _semdedup\n triu_sim_mat = torch.triu(pair_w_sim_matrix, diagonal=1)\n'
#####################
How I launch this script on multi GPU to avoid cuda out of memory
The text was updated successfully, but these errors were encountered:
Steps/Code to Reproduce Bug
Please provide minimal steps or a code snippet to reproduce the bug.
Using dataset size of 749,000 samples.
Running fine-tuning on allenai/tulu-3-sft-olmo-2-mixture.
Utilizing the latest NVIDIA drivers.
2025-02-05 07:27:31,421 - distributed.worker - ERROR - Compute Failed
Key: ('lambda-619f7ac64f13a38ca6c6546e6af3af28', 10)
State: executing
Task: <Task ('lambda-619f7ac64f13a38ca6c6546e6af3af28', 10) reify(...)>
Exception: "OutOfMemoryError('CUDA out of memory. Tried to allocate 59.96 GiB. GPU 0 has a total capacity of 79.10 GiB of which 17.46 GiB is free. Process 363562 has 61.61 GiB memory in use. Of the allocated memory 60.08 GiB is allocated by PyTorch, and 284.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)')"
Traceback: ' File "/usr/local/lib/python3.10/dist-packages/dask/bag/core.py", line 1875, in reify\n seq = list(seq)\n File "/usr/local/lib/python3.10/dist-packages/dask/bag/core.py", line 2063, in next\n return self.f(*vals)\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/modules/semantic_dedup.py", line 524, in \n lambda cluster_id: get_semantic_matches_per_cluster(\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/semdedup_utils.py", line 272, in get_semantic_matches_per_cluster\n M, M1 = _semdedup(cluster_reps, "cuda")\n File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/semdedup_utils.py", line 193, in _semdedup\n triu_sim_mat = torch.triu(pair_w_sim_matrix, diagonal=1)\n'
#####################
How I launch this script on multi GPU to avoid cuda out of memory
The text was updated successfully, but these errors were encountered: