Running into OOM with add id #142

yyu22 · 2024-07-08T16:25:42Z

Describe the bug

Running the add id module of curator runs into ooms even with small batch size, e.g., 32.
The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size.
Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory

Jun 25 13:48:18.323459 942129 slurmstepd   0x155552de2d40: error: Detected 7 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00046: task 5: Out Of Memory
srun: Terminating StepId=1127164.0
Jun 25 13:48:19.899767 2590557 slurmstepd   0x155552de2d40: error: Detected 1 oom_kill event in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00017: task 2: Terminated
srun: error: cpu-00038: task 3: Terminated
srun: error: cpu-00050: task 6: Terminated
Jun 25 13:48:20.991455 2567860 slurmstepd   0x155552de2d40: error: Detected 2 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: Force Terminated StepId=1127164.0

some observations:

Memory usage is extremely unbalanced across the nodes

cpu-00009              total        used        free      shared  buff/cache   available
Mem:            176          14         144           0          18         160
cpu-00042              total        used        free      shared  buff/cache   available
Mem:            176          79          88           0           8          94
cpu-00046              total        used        free      shared  buff/cache   available
Mem:            176         113          61           0           2          61
cpu-00082              total        used        free      shared  buff/cache   available
Mem:            176          13         145           0          17         160
cpu-00050              total        used        free      shared  buff/cache   available
Mem:            176          74          78           0          23          99
cpu-00019              total        used        free      shared  buff/cache   available
Mem:            176          72          38           0          65         101
cpu-00087              total        used        free      shared  buff/cache   available
Mem:            176          55         106           0          15         119
cpu-00086              total        used        free      shared  buff/cache   available
Mem:            176          90          80           0           6          84
cpu-00020              total        used        free      shared  buff/cache   available
Mem:            176          36         101           0          39         138
cpu-00002              total        used        free      shared  buff/cache   available
Mem:            176         156           2           0          17          18

Some nodes has very little memory left while some do not have any memory usage: (available memory shown in the last column)

$ grep -A 1 'cpu-00002' ./log.txt  | grep 'Mem:'
Mem:            176          85          64           0          26          88
Mem:            176         171           3           0           1           3
Mem:            176         141          28           0           6          32
Mem:            176         113          45           0          17          60
Mem:            176         113          45           0          17          60
Mem:            176         113          45           0          17          60

$ grep -A 1 'cpu-00082' ./log.txt  | grep 'Mem:'
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         144           0          17         160
Mem:            176          14         144           0          17         160
Mem:            176          14         144           0          17         160
Mem:            176          14         144           0          17         160

cpu utilization is very low.
Setting the start-index argument slows down the code
IO speed decreasing over time

Steps/Code to reproduce bug

    batch_index = 0
    for files in get_batched_files(data_path, id_data_path, "jsonl", batch_size=128):
        dataset = DocumentDataset.read_json(files, add_filename=True)
        print("Done reading dataset")
        add_id = AddId(
            id_field='id',
            id_prefix=f"rpv2-{batch_index}",
            )
        print("Start adding id")
        id_dataset = add_id(dataset)
        print("Done adding id")
        id_dataset.to_json(id_data_path, write_to_filename=True)
        batch_index += 1

The text was updated successfully, but these errors were encountered:

ryantwolf · 2024-08-09T15:42:26Z

The original OOM error is due to not properly limiting the number of workers based on memory.

Memory usage is extremely unbalanced across the nodes

I'm not sure about this one, but if I had to guess it would be that your batch size you've created is too small to use all workers available.

cpu utilization is very low.

Yes, add_id in general does not use much CPU and is heavily IO-bound.

Setting the start-index argument slows down the code

This is expected.

IO speed decreasing over time

Not sure about this one.

glam621 · 2024-08-19T20:26:16Z

Follow up with @yyu22 and @ryantwolf

sarahyurick · 2025-01-22T23:24:19Z

Could this be fixed by #479?

yyu22 added the bug Something isn't working label Jul 8, 2024

ryantwolf self-assigned this Jul 22, 2024

ryantwolf mentioned this issue Aug 9, 2024

Fix issues from RPV2 Tutorial #196

Merged

3 tasks

sarahyurick mentioned this issue Jan 22, 2025

Enable ADD ID to work with CPU/GPU both #479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running into OOM with add id #142

Running into OOM with add id #142

yyu22 commented Jul 8, 2024

ryantwolf commented Aug 9, 2024

glam621 commented Aug 19, 2024

sarahyurick commented Jan 22, 2025

Running into OOM with add id #142

Running into OOM with add id #142

Comments

yyu22 commented Jul 8, 2024

ryantwolf commented Aug 9, 2024

glam621 commented Aug 19, 2024

sarahyurick commented Jan 22, 2025