Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running into OOM with add id #142

Open
yyu22 opened this issue Jul 8, 2024 · 3 comments
Open

Running into OOM with add id #142

yyu22 opened this issue Jul 8, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@yyu22
Copy link
Contributor

yyu22 commented Jul 8, 2024

Describe the bug

Running the add id module of curator runs into ooms even with small batch size, e.g., 32.
The dataset for adding ID is a single snapshot of Red Pajama v2 dataset, which is about 4 TB in size.
Job was run on 10 cpu nodes. Each cpu node has 96 cores and 176 GB memory

Jun 25 13:48:18.323459 942129 slurmstepd   0x155552de2d40: error: Detected 7 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00046: task 5: Out Of Memory
srun: Terminating StepId=1127164.0
Jun 25 13:48:19.899767 2590557 slurmstepd   0x155552de2d40: error: Detected 1 oom_kill event in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: error: cpu-00017: task 2: Terminated
srun: error: cpu-00038: task 3: Terminated
srun: error: cpu-00050: task 6: Terminated
Jun 25 13:48:20.991455 2567860 slurmstepd   0x155552de2d40: error: Detected 2 oom_kill events in StepId=1127164.0. Some of the step tasks have been OOM Killed.
srun: Force Terminated StepId=1127164.0

some observations:

  • Memory usage is extremely unbalanced across the nodes
cpu-00009              total        used        free      shared  buff/cache   available
Mem:            176          14         144           0          18         160
cpu-00042              total        used        free      shared  buff/cache   available
Mem:            176          79          88           0           8          94
cpu-00046              total        used        free      shared  buff/cache   available
Mem:            176         113          61           0           2          61
cpu-00082              total        used        free      shared  buff/cache   available
Mem:            176          13         145           0          17         160
cpu-00050              total        used        free      shared  buff/cache   available
Mem:            176          74          78           0          23          99
cpu-00019              total        used        free      shared  buff/cache   available
Mem:            176          72          38           0          65         101
cpu-00087              total        used        free      shared  buff/cache   available
Mem:            176          55         106           0          15         119
cpu-00086              total        used        free      shared  buff/cache   available
Mem:            176          90          80           0           6          84
cpu-00020              total        used        free      shared  buff/cache   available
Mem:            176          36         101           0          39         138
cpu-00002              total        used        free      shared  buff/cache   available
Mem:            176         156           2           0          17          18
  • Some nodes has very little memory left while some do not have any memory usage: (available memory shown in the last column)
$ grep -A 1 'cpu-00002' ./log.txt  | grep 'Mem:'
Mem:            176          85          64           0          26          88
Mem:            176         171           3           0           1           3
Mem:            176         141          28           0           6          32
Mem:            176         113          45           0          17          60
Mem:            176         113          45           0          17          60
Mem:            176         113          45           0          17          60

$ grep -A 1 'cpu-00082' ./log.txt  | grep 'Mem:'
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         145           0          17         160
Mem:            176          13         144           0          17         160
Mem:            176          14         144           0          17         160
Mem:            176          14         144           0          17         160
Mem:            176          14         144           0          17         160

  • cpu utilization is very low.

  • Setting the start-index argument slows down the code

  • IO speed decreasing over time

Steps/Code to reproduce bug

    batch_index = 0
    for files in get_batched_files(data_path, id_data_path, "jsonl", batch_size=128):
        dataset = DocumentDataset.read_json(files, add_filename=True)
        print("Done reading dataset")
        add_id = AddId(
            id_field='id',
            id_prefix=f"rpv2-{batch_index}",
            )
        print("Start adding id")
        id_dataset = add_id(dataset)
        print("Done adding id")
        id_dataset.to_json(id_data_path, write_to_filename=True)
        batch_index += 1
@yyu22 yyu22 added the bug Something isn't working label Jul 8, 2024
@ryantwolf ryantwolf self-assigned this Jul 22, 2024
@ryantwolf
Copy link
Collaborator

The original OOM error is due to not properly limiting the number of workers based on memory.

Memory usage is extremely unbalanced across the nodes

I'm not sure about this one, but if I had to guess it would be that your batch size you've created is too small to use all workers available.

cpu utilization is very low.

Yes, add_id in general does not use much CPU and is heavily IO-bound.

Setting the start-index argument slows down the code

This is expected.

IO speed decreasing over time

Not sure about this one.

@glam621
Copy link
Collaborator

glam621 commented Aug 19, 2024

Follow up with @yyu22 and @ryantwolf

@sarahyurick
Copy link
Collaborator

Could this be fixed by #479?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants