You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running the add id module of curator runs into ooms even with small batch size, e.g., 32.
The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size.
Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory
Jun 25 13:48:18.323459 942129 slurmstepd 0x155552de2d40: error: Detected 7 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00046: task 5: Out Of Memory
srun: Terminating StepId=1127164.0
Jun 25 13:48:19.899767 2590557 slurmstepd 0x155552de2d40: error: Detected 1 oom_kill event in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00017: task 2: Terminated
srun: error: cpu-00038: task 3: Terminated
srun: error: cpu-00050: task 6: Terminated
Jun 25 13:48:20.991455 2567860 slurmstepd 0x155552de2d40: error: Detected 2 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: Force Terminated StepId=1127164.0
some observations:
Memory usage is extremely unbalanced across the nodes
cpu-00009 total used free shared buff/cache available
Mem: 176 14 144 0 18 160
cpu-00042 total used free shared buff/cache available
Mem: 176 79 88 0 8 94
cpu-00046 total used free shared buff/cache available
Mem: 176 113 61 0 2 61
cpu-00082 total used free shared buff/cache available
Mem: 176 13 145 0 17 160
cpu-00050 total used free shared buff/cache available
Mem: 176 74 78 0 23 99
cpu-00019 total used free shared buff/cache available
Mem: 176 72 38 0 65 101
cpu-00087 total used free shared buff/cache available
Mem: 176 55 106 0 15 119
cpu-00086 total used free shared buff/cache available
Mem: 176 90 80 0 6 84
cpu-00020 total used free shared buff/cache available
Mem: 176 36 101 0 39 138
cpu-00002 total used free shared buff/cache available
Mem: 176 156 2 0 17 18
Some nodes has very little memory left while some do not have any memory usage: (available memory shown in the last column)
Describe the bug
Running the add id module of curator runs into ooms even with small batch size, e.g., 32.
The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size.
Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory
some observations:
cpu utilization is very low.
Setting the
start-index
argument slows down the codeIO speed decreasing over time
Steps/Code to reproduce bug
The text was updated successfully, but these errors were encountered: