Mean fails to compute for very large column of pyarrow type #11113

camposandro · 2024-05-09T15:52:14Z

It's not possible to calculate the grouped mean for a very large column of pyarrow type.

import pandas as pd
import numpy as np
import pyarrow as pa

import dask.dataframe
import dask.distributed

df = pd.DataFrame({
    # A series of ones that is larger than the maximum supported by uint32
    'a': pd.Series(np.ones(1 << 32), dtype=pd.ArrowDtype(pa.uint8())),
    # A distribution of values for which to compute the mean for
    'b': pd.Series(np.linspace(0, 1, 1 << 32), dtype=pd.ArrowDtype(pa.float32())),
})

It fails to compute with the legacy DataFrame:

with dask.distributed.Client():
    dask.dataframe.io.from_pandas(df, npartitions=5).groupby('a').b.mean().compute()

2024-05-09 08:33:07,340 - distributed.worker - WARNING - Compute Failed
Key:       ('truediv-8d00596f7743a7ba07a11eb25b3109ae', 0)
Function:  subgraph_callable-f24250c276b80c2c50dfa1c2094ebf85
args:      (a
1    2.147484e+09
Name: b, dtype: float[pyarrow], a
1    4294967296
Name: b, dtype: int64)
kwargs:    {}
Exception: "ArrowInvalid('Integer value 4294967296 not in range: -16777216 to 16777216')"

I tried using the latest DataFrame API but this operation does not seem to be supported yet:

with dask.distributed.Client():
    dask.dataframe.from_pandas(df).groupby('a').b.mean().compute()

NotImplementedError: <class 'pandas.core.arrays.arrow.array.ArrowExtensionArray'> does not support reshape as backed by a 1D pyarrow.ChunkedArray.

Pandas seems to execute this workflow without issues:

df.groupby('a').b.mean()

a
1    0.5
Name: b, dtype: float[pyarrow]

Environment:

Dask version: 2024.5.0
Python version: 3.10.13
Operating System: CentOS Linux 7
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

phofl · 2024-05-13T22:00:22Z

Thanks for your report. For context: we are basically using

df.groupby('a').b.sum() / df.groupby('a').b.count()

under the hood to compute the mean, which also fails. I agree this is not great, I'll check if and how we can address this

github-actions bot added the needs triage Needs a response from a contributor label May 9, 2024

phofl added dataframe and removed needs triage Needs a response from a contributor labels May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mean fails to compute for very large column of pyarrow type #11113

Mean fails to compute for very large column of pyarrow type #11113

camposandro commented May 9, 2024

phofl commented May 13, 2024

Mean fails to compute for very large column of pyarrow type #11113

Mean fails to compute for very large column of pyarrow type #11113

Comments

camposandro commented May 9, 2024

phofl commented May 13, 2024