You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed some code of mine running slower than expected and with much higher memory footprint. Eventually I realised the issue was that I was applying a filter on an enum column that couldn't be done until a step where I was casting an integer column to the Enum dtype, which came quite late in my query. This seemed like an optimization that could be done since the underlying integers of the enum are known from the dtype of the enum. Though I admit it's probably niche in application. I decided to share the idea plus an example where it would help to see if people agree it's a fair optimisation to consider.
importdatetimeimportpolarsaspln=1_000_000enum_dtype=pl.Enum(["a", "b", "c"])
# create a series with 3 values and 2 occurs exactly oncecodes=pl.Series([0, 1] *n)
codes[0] =2df= (
pl.DataFrame({"enum": pl.Series(codes), "x": pl.Series(list(range(2*n)))})
.with_columns(
t=pl.datetime_range(
datetime.datetime(2025, 1, 1),
datetime.datetime(2025, 1, 1) + ((2*n-1) *datetime.timedelta(seconds=1)),
interval="1s",
)
)
.lazy()
)
# do some operations that take some time but can be much faster if done post-filtering on `enum`ldf_with_lazy_ops= (
df.with_columns(
x_sum_over_enum=pl.col.x.sum().over("enum"),
x_n_unique_over_enum=pl.col.x.n_unique().over("enum"),
t=pl.col.t.dt.truncate("1m"),
)
.group_by("enum", "t")
.agg(pl.all().median())
)
print("Timing cast then filter")
%timeitldf_with_lazy_ops.cast({"enum": enum_dtype}).filter(pl.col.enum=="c").collect()
print("Timing filter then cast")
%timeitldf_with_lazy_ops.filter(pl.col.enum==2).cast({"enum": enum_dtype}).collect()
Output:
Timing cast then filter
126 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timing filter then cast
2.93 ms ± 519 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The text was updated successfully, but these errors were encountered:
Description
I noticed some code of mine running slower than expected and with much higher memory footprint. Eventually I realised the issue was that I was applying a filter on an enum column that couldn't be done until a step where I was casting an integer column to the Enum dtype, which came quite late in my query. This seemed like an optimization that could be done since the underlying integers of the enum are known from the dtype of the enum. Though I admit it's probably niche in application. I decided to share the idea plus an example where it would help to see if people agree it's a fair optimisation to consider.
Output:
The text was updated successfully, but these errors were encountered: