Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reading parquet dataset on GPU throws "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error #1873

Open
orlev2 opened this issue Feb 13, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@orlev2
Copy link

orlev2 commented Feb 13, 2024

Describe the bug
Reading parquet dataset on GPU throws an "cudf engine doesn't support the following keyword arguments: ['strings_to_categorical']" error. Reading the data on CPU runs successfully:

distributed.worker - WARNING - Compute Failed
Key:       _sample_row_group-1d48e09b-5b56-4e62-92ec-860ff2f9dd40
Function:  execute_task
args:      ((<function apply at 0x7f6b6e47f010>, <function _sample_row_group at 0x7f692006f130>, ['path/to/parquet_files/000000000000.parquet', <gcsfs.core.GCSFileSystem object at 0x7f6ae65ec6d0>], (<class 'dict'>, [['cpu', False], ['memory_usage', True]])))
kwargs:    {}
Exception: 'ValueError("cudf engine doesn\'t support the following keyword arguments: [\'strings_to_categorical\']")'

Steps/Code to reproduce bug

dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=False)
# Fails with ValueError: cudf engine doesn't support the following keyword arguments: ['strings_to_categorical'] 

dataset_nvt = nvt.Dataset(dataset, engine='parquet', cpu=True)
# Runs successfully

Expected behavior
The dataset should be read from file under both cpu=True/False

Environment details (please complete the following information):

  • Environment location: GCP vertex ai notebook (GPU: NVIDIA V100 x 1)
  • Method of NVTabular install: conda

nvtabular == 23.8.00
cudf == 23.10.02 (above error was also present under 23.12.01)
dask == 2023.9.2

@niraj06

@orlev2 orlev2 added the bug Something isn't working label Feb 13, 2024
@orlev2
Copy link
Author

orlev2 commented Feb 14, 2024

The following workaround works in loading the data:

dataset = 'path/to/parquet_files/*.parquet'
dataset_nvt = nvt.Dataset(
    dask_cudf.read_parquet(dataset), engine='parquet', cpu=False
)
# <merlin.io.dataset.Dataset at 0x7f99bc042230>

However, applying workflow transform fails:

workflow = nvt.Workflow.load(f"path/to/workflow")
workflow.transform(dataset_nvt)
# Exception: "TypeError('String Arrays is not yet implemented in cudf')"

full error:

Key:       ('transform-bdc9b5878b9eff9e4e8eb287f652e68a', 63)
Function:  subgraph_callable-6a50eb3e-1830-40d8-bff7-0a6db4e7
args:      ([<Node SelectionOp>], 'read-parquet-070e46c56ae3f13e04d07d8cae7b3f14', {'piece': ('path/to/parquet_files/000000000000.parquet', None, None)})
kwargs:    {}
Exception: "TypeError('String Arrays is not yet implemented in cudf')"

The workflow includes nvt.ops.Categorify and nvt.ops.Groupby operations to create a string array of sequential events per grouped entity.

@rjzamora
Copy link
Collaborator

Sorry for this ridiculously late response @orlev2 - Just coming across this now.

As far as I can tell, the rapids/dask pinning in Merlin has been far too loose. NVTabular 23.8 was definitely not tested with cudf>=23.08 or dask>=2023.8.

The merlin 23.08 containers use cudf-23.04 (which uses dask-2023.1.1), so using that is your best bet.

NOTE: The lack of upper pinning in NVTabular is indeed a "bug" of sorts - I apologize about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants