lazy optimization to move filter on enum values before casting to enum #21615

ChristopherRussell · 2025-03-05T20:54:01Z

Description

I noticed some code of mine running slower than expected and with much higher memory footprint. Eventually I realised the issue was that I was applying a filter on an enum column that couldn't be done until a step where I was casting an integer column to the Enum dtype, which came quite late in my query. This seemed like an optimization that could be done since the underlying integers of the enum are known from the dtype of the enum. Though I admit it's probably niche in application. I decided to share the idea plus an example where it would help to see if people agree it's a fair optimisation to consider.

import datetime
import polars as pl

n = 1_000_000
enum_dtype = pl.Enum(["a", "b", "c"])

# create a series with 3 values and 2 occurs exactly once
codes = pl.Series([0, 1] * n)
codes[0] = 2

df = (
   pl.DataFrame({"enum": pl.Series(codes), "x": pl.Series(list(range(2 * n)))})
   .with_columns(
       t=pl.datetime_range(
           datetime.datetime(2025, 1, 1),
           datetime.datetime(2025, 1, 1) + ((2 * n - 1) * datetime.timedelta(seconds=1)),
           interval="1s",
       )
   )
   .lazy()
)

# do some operations that take some time but can be much faster if done post-filtering on `enum`
ldf_with_lazy_ops = (
   df.with_columns(
       x_sum_over_enum=pl.col.x.sum().over("enum"),
       x_n_unique_over_enum=pl.col.x.n_unique().over("enum"),
       t=pl.col.t.dt.truncate("1m"),
   )
   .group_by("enum", "t")
   .agg(pl.all().median())
)
print("Timing cast then filter")
%timeit ldf_with_lazy_ops.cast({"enum": enum_dtype}).filter(pl.col.enum == "c").collect()

print("Timing filter then cast")
%timeit ldf_with_lazy_ops.filter(pl.col.enum == 2).cast({"enum": enum_dtype}).collect()

Output:

Timing cast then filter
126 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timing filter then cast
2.93 ms ± 519 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The text was updated successfully, but these errors were encountered:

ChristopherRussell added the enhancement New feature or an improvement of an existing feature label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazy optimization to move filter on enum values before casting to enum #21615

lazy optimization to move filter on enum values before casting to enum #21615

ChristopherRussell commented Mar 5, 2025

lazy optimization to move filter on enum values before casting to enum #21615

lazy optimization to move filter on enum values before casting to enum #21615

Comments

ChristopherRussell commented Mar 5, 2025

Description