Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazy optimization to move filter on enum values before casting to enum #21615

Open
ChristopherRussell opened this issue Mar 5, 2025 · 0 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@ChristopherRussell
Copy link

Description

I noticed some code of mine running slower than expected and with much higher memory footprint. Eventually I realised the issue was that I was applying a filter on an enum column that couldn't be done until a step where I was casting an integer column to the Enum dtype, which came quite late in my query. This seemed like an optimization that could be done since the underlying integers of the enum are known from the dtype of the enum. Though I admit it's probably niche in application. I decided to share the idea plus an example where it would help to see if people agree it's a fair optimisation to consider.

import datetime
import polars as pl

n = 1_000_000
enum_dtype = pl.Enum(["a", "b", "c"])

# create a series with 3 values and 2 occurs exactly once
codes = pl.Series([0, 1] * n)
codes[0] = 2

df = (
   pl.DataFrame({"enum": pl.Series(codes), "x": pl.Series(list(range(2 * n)))})
   .with_columns(
       t=pl.datetime_range(
           datetime.datetime(2025, 1, 1),
           datetime.datetime(2025, 1, 1) + ((2 * n - 1) * datetime.timedelta(seconds=1)),
           interval="1s",
       )
   )
   .lazy()
)

# do some operations that take some time but can be much faster if done post-filtering on `enum`
ldf_with_lazy_ops = (
   df.with_columns(
       x_sum_over_enum=pl.col.x.sum().over("enum"),
       x_n_unique_over_enum=pl.col.x.n_unique().over("enum"),
       t=pl.col.t.dt.truncate("1m"),
   )
   .group_by("enum", "t")
   .agg(pl.all().median())
)
print("Timing cast then filter")
%timeit ldf_with_lazy_ops.cast({"enum": enum_dtype}).filter(pl.col.enum == "c").collect()

print("Timing filter then cast")
%timeit ldf_with_lazy_ops.filter(pl.col.enum == 2).cast({"enum": enum_dtype}).collect()

Output:

Timing cast then filter
126 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timing filter then cast
2.93 ms ± 519 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@ChristopherRussell ChristopherRussell added the enhancement New feature or an improvement of an existing feature label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant