Skip to content

Simplify to_dask_cudf for compatibility with newer versions of dask #6496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

TomAugspurger
Copy link
Contributor

With newer versions of Dask, some tests like cuml/python/cuml/cuml/tests/dask/test_dask_tsvd.py::test_pca_fit[dataframe-data_info0] are failing with a message like

distributed.client.FuturesCancelledError: 30 Futures cancelled:
Reasons:
30 Futures cancelled for reason: lost dependencies.

This PR simplifies the code used in the tests that converts from a Dask Array to a Dask DataFrame. It will hopefully be compatible the version of dask currently used in CI (2025.2.0, I believe) but I'm relying on CI to check that. I've tested that it works with Dask main.

@TomAugspurger
Copy link
Contributor Author

I wasn't sure whether to target 25.06 or 25.04.

@jcrist jcrist added bug Something isn't working non-breaking Non-breaking change labels Mar 31, 2025
@csadorf
Copy link
Contributor

csadorf commented Mar 31, 2025

I wasn't sure whether to target 25.06 or 25.04.

Considering that the rapids-dask-dependency for 25.04 depends on the pinned dask version 2025.02.0 (https://github.com/rapidsai/rapids-dask-dependency/blob/branch-25.04/conda/recipes/rapids-dask-dependency/meta.yaml), I see no need to change this in version 25.04 and so we should target 25.06. Do you know when we will bump our dask dependency?

@TomAugspurger
Copy link
Contributor Author

Do you know when we will bump our dask dependency?

I'm expecting to update the dependency for rapids 25.06, after the next Dask release (in the next couple of weeks).

@TomAugspurger
Copy link
Contributor Author

In 898bf6f I've updated the test to avoid the failure at https://github.com/rapidsai/cuml/actions/runs/14173228016/job/39704398544?pr=6496#step:9:3580

FAILED test_dask_kneighbors_classifier.py::test_predict_proba[dataset0-parameters0-dask_cudf] - assert (900, 5) == (900, 5, 1)
  Right contains one more item: 1
  Full diff:
  - (900, 5, 1)
  ?        ---
  + (900, 5)

I'm looking a bit more, but this is indicating an issue. We seem to be getting back a pandas DataFrame from da.compute(d_probas) where we previously maybe got cudf DataFrames. I'll look into this now.

@TomAugspurger
Copy link
Contributor Author

Actually, maybe this is expected. The return type of predict_proba generally seems to match the input type. And for dask inputs, the concrete type (pandas or cudf) is preserved.

# pandas -> pandas
(Pdb) type(l_model.predict_proba(X_test.compute())[0])
<class 'pandas.core.frame.DataFrame'>
# cudf -> cudf
(Pdb) type(l_model.predict_proba(cudf.from_pandas(X_test.compute()))[0])
<class 'cudf.core.dataframe.DataFrame'>
# numpy -> numpy
(Pdb) type(l_model.predict_proba(X_test.compute().to_numpy())[0])
<class 'numpy.ndarray'>
# cudf -> cupy
(Pdb) type(l_model.predict_proba(cudf.from_pandas(X_test.compute()).values)[0])
<class 'cupy.ndarray'>

# DaskDataFrame[pandas] -> pandas
(Pdb) type(d_model.predict_proba(X_test)[0].compute())
<class 'pandas.core.frame.DataFrame'>

# DaskDataFrame[cudf] -> cudf
(Pdb) type(d_model.predict_proba(X_test.map_partitions(cudf.from_pandas))[0].compute())
<class 'cudf.core.dataframe.DataFrame'>

So maybe we're OK. But I'd appreciate a close review of that change, and a confirmation that we want to be passing host inputs (numpy / pandas) rather than device inputs (cupy / cudf).

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 2, 2025

Has anyone seen https://github.com/rapidsai/cuml/actions/runs/14200165809/job/39787356294?pr=6496#step:9:3056
in test_accuracy_score before?

  File "/opt/conda/envs/test/lib/python3.12/site-packages/cudf/core/dtypes.py", line 69, in dtype
    raise ValueError(
ValueError: cudf does not support object dtype. Use 'str' instead.

I can't reproduce that locally, but I haven't tried matching versions exactly yet.

@TomAugspurger
Copy link
Contributor Author

Closing in favor of #6614.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants