You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Use 5*A100 GPUs to do fuzzey_dedup task and encountered OOM issues. here is error info
2024-12-31 05:02:43,370 - distributed.worker - ERROR - Could not serialize object of type DataFrame
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/distributed/protocol/serialize.py", line 366, in serialize
header, frames = dumps(x, context=context) if wants_context else dumps(x)
File "/usr/local/lib/python3.10/dist-packages/distributed/protocol/serialize.py", line 52, in dask_dumps
sub_header, frames = dumps(x)
File "/usr/local/lib/python3.10/dist-packages/cudf/comm/serialize.py", line 19, in dask_serialize_cudf_object
return x.host_serialize()
File "/usr/local/lib/python3.10/dist-packages/cudf/core/abc.py", line 150, in host_serialize
header, frames = self.device_serialize()
File "/usr/local/lib/python3.10/dist-packages/cudf/core/abc.py", line 90, in device_serialize
header, frames = self.serialize()
File "/usr/local/lib/python3.10/dist-packages/cudf/core/dataframe.py", line 1171, in serialize
header, frames = super().serialize()
File "/usr/local/lib/python3.10/dist-packages/cudf/utils/performance_tracking.py", line 51, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/frame.py", line 100, in serialize
header["columns"], frames = serialize_columns(self._columns)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 2279, in serialize_columns
header_columns = [c.serialize() for c in columns]
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 2279, in <listcomp>
header_columns = [c.serialize() for c in columns]
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 1226, in serialize
if self.children:
File "column.pyx", line 293, in cudf._lib.column.Column.children.__get__
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-fkfud_57/normal/lib/python3.10/site-packages/librmm/include/rmm/mr/device/cuda_memory_resource.hpp:60: cudaErrorMemoryAllocation out of memory
Thanks for raising the issue @chenrui17 .
For 8TB of input data on 5 A100 GPUs (~400GB memory) the memory requirements to hold intermediates during stages like LSH might lead to OOM's.
I have a few recommendations to reduce memory and computational requirements at this scale.
char_ngrams=24, # use larger ngram size to reduce false positives.buckets_per_shuffle=1, # process 1 bucket per iteration of LSH to reduce memory requirements.# skip the false positive check which is computationally expensive. # In practice this is usually 1-2% of documents based on our experiments.false_positive_check=False,
Some of the changes suggested above are becoming the default in Curator (see #386).
Additionally I would recommend parquet files <= 2GB uncompressed if you have large files. If using small files, you can use the blocksize=1GB arg during read_parquet (here:
Internally we've typically used 16-24 GPUs for processing data at this scale so I'm not sure if these suggestions will prevent OOM errors on 5 GPUs, but happy to follow up and see if this improves things.
Describe the bug
Use 5*A100 GPUs to do fuzzey_dedup task and encountered OOM issues. here is error info
Steps/Code to reproduce bug
Environment overview (please complete the following information)
Additional context
use dclm-baseline 1.0 parquet data and totally 8TB parquet data(after add
nemo_id
and no compression)The text was updated successfully, but these errors were encountered: