Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean fails to compute for very large column of pyarrow type #11113

Open
camposandro opened this issue May 9, 2024 · 1 comment
Open

Mean fails to compute for very large column of pyarrow type #11113

camposandro opened this issue May 9, 2024 · 1 comment

Comments

@camposandro
Copy link

It's not possible to calculate the grouped mean for a very large column of pyarrow type.

import pandas as pd
import numpy as np
import pyarrow as pa

import dask.dataframe
import dask.distributed

df = pd.DataFrame({
    # A series of ones that is larger than the maximum supported by uint32
    'a': pd.Series(np.ones(1 << 32), dtype=pd.ArrowDtype(pa.uint8())),
    # A distribution of values for which to compute the mean for
    'b': pd.Series(np.linspace(0, 1, 1 << 32), dtype=pd.ArrowDtype(pa.float32())),
})

It fails to compute with the legacy DataFrame:

with dask.distributed.Client():
    dask.dataframe.io.from_pandas(df, npartitions=5).groupby('a').b.mean().compute()
2024-05-09 08:33:07,340 - distributed.worker - WARNING - Compute Failed
Key:       ('truediv-8d00596f7743a7ba07a11eb25b3109ae', 0)
Function:  subgraph_callable-f24250c276b80c2c50dfa1c2094ebf85
args:      (a
1    2.147484e+09
Name: b, dtype: float[pyarrow], a
1    4294967296
Name: b, dtype: int64)
kwargs:    {}
Exception: "ArrowInvalid('Integer value 4294967296 not in range: -16777216 to 16777216')"

I tried using the latest DataFrame API but this operation does not seem to be supported yet:

with dask.distributed.Client():
    dask.dataframe.from_pandas(df).groupby('a').b.mean().compute()

NotImplementedError: <class 'pandas.core.arrays.arrow.array.ArrowExtensionArray'> does not support reshape as backed by a 1D pyarrow.ChunkedArray.

Pandas seems to execute this workflow without issues:

df.groupby('a').b.mean()
a
1    0.5
Name: b, dtype: float[pyarrow]

Environment:

  • Dask version: 2024.5.0
  • Python version: 3.10.13
  • Operating System: CentOS Linux 7
  • Install method (conda, pip, source): pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label May 9, 2024
@phofl
Copy link
Collaborator

phofl commented May 13, 2024

Thanks for your report. For context: we are basically using

df.groupby('a').b.sum() / df.groupby('a').b.count()

under the hood to compute the mean, which also fails. I agree this is not great, I'll check if and how we can address this

@phofl phofl added dataframe and removed needs triage Needs a response from a contributor labels May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants