Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignore_nulls (and ignore_nan) option to xdt.ewma_by_time #71

Open
wbeardall opened this issue Mar 22, 2024 · 7 comments
Open

Add ignore_nulls (and ignore_nan) option to xdt.ewma_by_time #71

wbeardall opened this issue Mar 22, 2024 · 7 comments

Comments

@wbeardall
Copy link

Currently, if there are any NaN values in the value column passed to xdt.ewma_by_time, then all following values in the output are NaN (see snippet). It would be great if there was n ignore_nulls flag, similar to in the builtin ewma, to allow for NaN or null values to be ignored during calculation, to prevent this. In this case, the presence or absence of a row containing Null or NaN should have no effect on subsequent rows; i.e. the ewma-ed output of the final row of the two following tables should be identical.

shape: (2, 2)
┌───────────┬────────────────────────────┐
│ values    ┆ time                       │
│ ---       ┆ ---                        │
│ f64       ┆ datetime[ns]               │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00        │
│ 0.186466  ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘
shape: (3, 2)
┌───────────┬────────────────────────────┐
│ values    ┆ time                       │
│ ---       ┆ ---                        │
│ f64       ┆ datetime[ns]               │
╞═══════════╪════════════════════════════╡
│ -0.042898 ┆ 2000-01-01 00:00:00        │
│ NaN       ┆ 2000-01-01 00:00:00.000001 │
│ 0.186466  ┆ 2000-01-01 00:00:00.000002 │
└───────────┴────────────────────────────┘

Reproducible snippet

from datetime import timedelta

import numpy as np
import polars as pl
import polars_xdt as xdt


n = 100

df = pl.DataFrame({
    "values": np.linspace(0, 10, n) + 0.1 * np.random.normal(size=n),
    "time": np.datetime64("2000-01-01 00:00:00") + np.asarray([i*np.timedelta64(1000, "ns") for i in range(n)])
})


new = df.with_columns(xdt.ewma_by_time("values", times="time", half_life=timedelta(microseconds=1)).alias("ewma"))

# True
print(new["ewma"].is_finite().all())

new_with_nan = df.with_columns(xdt.ewma_by_time(
    pl.when(pl.col("values") > 5).then(np.nan).otherwise(pl.col("values")), times="time", half_life=timedelta(microseconds=1)
).alias("ewma"))

# False
print(new_with_nan["ewma"].is_finite().all())

@MarcoGorelli
Copy link
Collaborator

thanks @wbeardall for the request! seems reasonable, will take a look

@wbeardall
Copy link
Author

I've submitted a PR for how I'd go about implementing this. Let me know any thoughts!

@MarcoGorelli
Copy link
Collaborator

similar to in the builtin ewma, to allow for NaN or null values to be ignored during calculation

Are you sure this is what the ewm_mean one does?

In [15]: s = pl.Series([1.1, 2.5, 2.6, 2.1, float('nan'), 5.1])

In [16]: s.ewm_mean(alpha=.1, ignore_nulls=True)
Out[16]:
shape: (6,)
Series: '' [f64]
[
        1.1
        1.836842
        2.11845
        2.113085
        NaN
        NaN
]

Looks like NaN values still propagate there?

Which looks correct to me - Polars (unlike pandas) generally distinguishes NaN and null

@wbeardall
Copy link
Author

I think you might be right here; in honesty, I'm a very recent convert from Pandas, and may have mistaken the design pattern in pl.ewm_mean. The main motivation for this issue was to create a feature by which you can prevent NaN propagation in time series data, similar to how Pandas handles the presence of NaN elements with their ignore_na flag (below); this might not necessarily be the same as the ignore_nulls feature of pl.ewm_mean. Seems like this is the same question as yours last night about the need to distinguish between NaN and null values.

>>> import pandas as pd
>>> s = pd.Series([1.1, 2.5, 2.6, 2.1, float('nan'), 5.1])
>>> s.ewm(alpha=.1, ignore_na=True, min_periods=1).mean()
0    1.100000
1    1.836842
2    2.118450
3    2.113085
4    2.113085
5    2.842473
dtype: float64

My particular use-cases, and the PR that I submitted last night, are focused on the ignore_na case, which I would appreciate being a feature; is it worth adding ignore_nulls as another feature under the same or another issue?

@wbeardall wbeardall changed the title Add ignore_nulls option to xdt.ewma_by_time Add ignore_nulls (and ignore_nan) option to xdt.ewma_by_time Mar 23, 2024
@wbeardall
Copy link
Author

Perhaps it is better to propagate NaNs, whilst having null values behave as initially written in the PR, and communicate to users that this is the mechanism by which they should enable propagation past such values? e.g.

>>> import polars as pl
>>> s = pl.Series([1.1, 2.5, 2.6, 2.1, float('nan'), 5.1])
>>> s.fill_nan(None).ewm_mean(alpha=.1, ignore_nulls=True)
shape: (6,)
Series: '' [f64]
[
        1.1
        1.836842
        2.11845
        2.113085
        2.113085
        2.842473
]

@MarcoGorelli
Copy link
Collaborator

yeah doing s.fill_nan(None) first feels like idiomatic Polars

@wbeardall
Copy link
Author

I've pushed an implementation for the above, as well as improving robustness. In the previous version, if a series started with a null value, the kernel would panic, as it would attempt to call .unwrap() on said null; let me know thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants