Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Groupby ags output columns (even if they are counts) as Tags.CATEGORICAL #1841

Open
radekosmulski opened this issue Jun 9, 2023 · 6 comments
Labels
bug Something isn't working P2

Comments

@radekosmulski
Copy link
Contributor

Describe the bug
The output of the Categorify operator is tagged as Tags.CATEGORICAL even if those might just be counts.
image

Steps/Code to reproduce bug
Here is a full reproducer:

df = cudf.DataFrame(data={'id': [0, 0, 1, 2], 'cat': list(range(4)), 'val': [10, 20, 20, 30]})

ds = nvt.Dataset(df)

cat = ['cat'] >> nvt.ops.Categorify()
groupby_features = cat + ['id', 'val'] >> nvt.ops.Groupby(
    groupby_cols=['id'],
    aggs={
        'cat': ['list', 'count'],
        'val': ['list']
    }
)

wf = nvt.Workflow(groupby_features)
ds_transformed = wf.fit_transform(ds)

Expected behavior
count columns are not tagged as Tags.CATEGORICAL

Environment details (please complete the following information):
current main

@radekosmulski radekosmulski added bug Something isn't working P2 labels Jun 9, 2023
@karlhigley
Copy link
Contributor

If I understand the example, the behavior of Categorify doesn't seem to be the issue. If the workflow stopped after the Categorify, categorical would be the correct label (since the output of Categorify is always categorical by definition.)

But since the workflow continues after that, the subsequent operators need to adjust the schemas of the columns they operate on in order to reflect the transforms they implement. It looks like Groupby isn't correctly applying Tags.CONTINUOUS when the aggregation is a count.

@karlhigley
Copy link
Contributor

@jperez999, could you point someone who isn't you (@radekosmulski, @oliverholworthy, @nv-alaiacano, or @edknv maybe, if they're willing) at how to fix this in the Groupby op?

@rnyak
Copy link
Contributor

rnyak commented Jun 20, 2023

@karlhigley @radekosmulski tagging count column as Continuous might be problematic as well.. Since an integer value (count is an int), can be seen as a categorical value for a user. It does not have to be continuous. Therefore, I'd say we should not tag count agg neither as continuous nor as categorical from Groupby op.

If a count column normalized afterwards, then it will take continuous tag, if it is gonna be bucketed or hashed via categorify op then it can take categorical tag.

@karlhigley
Copy link
Contributor

I think you're right about how things should ultimately be tagged, but I want to re-iterate that it's each op's responsibility to apply the appropriate tags based only on the transformation that op applies (not future or previous ops, who can take care of that for themselves.)

If the result of a count aggregation is continuous, then Groupby should tag it as such. If subsequent operations like Normalize or hashing/bucketing operations change whether it's continuous, categorical, or an embedding, then those ops should apply whatever tag becomes appropriate after their transformations.

We don't want to be guessing what ops a user might apply subsequently in order to apply tags; we should just apply the currently correct tags and rely on subsequent ops to change them as needed.

@rnyak
Copy link
Contributor

rnyak commented Jun 21, 2023

@karlhigley I agree. What I wanted to say should count be considered continuous or categorical? I say it could be both :)

@karlhigley
Copy link
Contributor

I don't know either 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2
Projects
None yet
Development

No branches or pull requests

3 participants