Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate function that operates on vector(array of numeric) data #11119

Open
Rhett-Ying opened this issue May 14, 2024 · 0 comments
Open

aggregate function that operates on vector(array of numeric) data #11119

Rhett-Ying opened this issue May 14, 2024 · 0 comments
Labels
needs triage Needs a response from a contributor

Comments

@Rhett-Ying
Copy link

I am wondering if dask or pandas has native or built-in support for aggregate function that run against vector data. Namley, text/image embeddings are stored in the column of csv/parquet file. And I'd like to run various aggregate functions such as mean, max and so on. All these operations are element-wise, namely, it returns the mean of all the values in same index and return an array with same lenght. What's more, I'd like to run K-Nearest-Neighbor search as well.

If not natively supported, how to achieve these operations with performance efficient?

example code:

import dask.dataframe as dd
import pandas as pd
import numpy as np

# Sample DataFrame with arrays in one of the columns
data = {
    'category': ['A', 'A', 'B', 'B'],
    'values': [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9]), np.array([10, 11, 12])],
    'scalar': [1, 2, 3, 4]
}
pdf = pd.DataFrame(data)

# Convert the Pandas DataFrame to a Dask DataFrame
ddf = dd.from_pandas(pdf, npartitions=2)

result = ddf.groupby('category')['values'].mean().compute()

print(result)

Expected output

category
A     [2.5, 3.5, 4.5]
B    [8.5, 9.5, 10.5]
Name: values, dtype: object
@github-actions github-actions bot added the needs triage Needs a response from a contributor label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

1 participant