Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a detection metric on time series data with no entity_columns fails #77

Closed
JacekCala opened this issue Sep 16, 2021 · 2 comments
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:WAI The software is working as intended

Comments

@JacekCala
Copy link

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.3.2
  • Python version: 3.8.6
  • Operating System: CentOS

Error Description

My data is a sequence of a single type with no entity_columns set. As described in the user guide, model training works fine, however, I'm unable to run the detection metrics to check the "goodness" of fit. The error I get is ValueError: No group keys passed!

Looking into sources, indeed there's the problem as the code relies on the entity_columns variable:

@staticmethod
def _build_x(data, transformer, entity_columns):
X = pd.DataFrame()
for entity_id, entity_data in data.groupby(entity_columns):
entity_data = entity_data.drop(entity_columns, axis=1)
entity_data = transformer.transform(entity_data)
entity_data = pd.Series({
column: entity_data[column].values
for column in entity_data.columns
}, name=entity_id)
X = X.append(entity_data)
return X

I'm wondering whether a simple change can fix the problem correcly (see below). Could you please confirm if this is the right way of thinking?

This is change in _build_x:

    def _build_x(data, transformer, entity_columns):
        X = pd.DataFrame()
        if entity_columns:
            for entity_id, entity_data in data.groupby(entity_columns):
                # code as in the original detection.py L41-47...

        else:
            entity_data = transformer.transform(data)
            entity_data = pd.Series({
                column: entity_data[column].values
                for column in entity_data.columns
            })
            X = pd.DataFrame([entity_data])

        return X

and one more in the compute method, line

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)

to be changed to:

       if entity_columns:
            X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)
        else:
            X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

Steps to reproduce

A simple example based on the user guide:

from sdv.demo import load_timeseries_demo
from sdv.timeseries import PAR
from sdv.metrics.timeseries import LSTMDetection

data = load_timeseries_demo()
no_context = data[['Symbol', 'Date', 'Open', 'Close', 'Volume']].copy()
real_data = no_context[no_context.Symbol == 'TSLA'].copy()
del real_data['Symbol']

sequence_index = 'Date'
model = PAR(sequence_index = sequence_index)

print('Fitting model...')
model.fit(real_data)

print('Sampling data...')
synth_data = model.sample()

print('Running evaluation...')
val = LSTMDetection.compute(real_data, synth_data, metadata={
    'sequence_index': sequence_index,
    'fields': {
        'Date': {'type': 'datetime'},
        'Open': {'type': 'numerical', 'subtype': 'float'},
        'Close': {'type': 'numerical', 'subtype': 'float'},
        'Volume': {'type': 'numerical', 'subtype': 'integer'}
    }})

Once this is run, the error traceback is as follows:

     val = LSTMDetection.compute(real_data, synth_data, metadata={
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 85, in compute
    real_x = cls._build_x(real_data, transformer, entity_columns)
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 40, in _build_x
    for entity_id, entity_data in data.groupby(entity_columns):
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\frame.py", line 6515, in groupby
    return DataFrameGroupBy(
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\groupby.py", line 525, in __init__
    grouper, exclusions, obj = get_grouper(
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\grouper.py", line 821, in get_grouper
    raise ValueError("No group keys passed!")
ValueError: No group keys passed!
@JacekCala JacekCala added bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started labels Sep 16, 2021
@JacekCala JacekCala changed the title Trying to run the detection metric on time series data with no entity_columns fails Trying to run a detection metric on time series data with no entity_columns fails Sep 16, 2021
@JacekCala JacekCala changed the title Trying to run a detection metric on time series data with no entity_columns fails Running a detection metric on time series data with no entity_columns fails Sep 16, 2021
@npatki
Copy link
Contributor

npatki commented Jul 14, 2022

Hi @JacekCala, thanks for the detailed explanation and stack traces.

If your dataset has no entity_columns, then it represents just a single sequence. In that case, the synthetic sequence should have a direct correlation with the real sequence. At this point, it may make more sense for you to apply the single table metrics because they directly compare the correlation values and columns shapes.

These timeseries metrics are specially considering multi-sequence data. With multi-sequence data, a synthetic sequence cannot be directly mapped or compared to a specific real sequence, so a more general metric (such as LSTMDetection) is needed. But if you only have a single sequence, I'm not sure that this metric will be that useful to you anyways.

Let me know what you think!

@npatki npatki added under discussion Issue is currently being discussed data:sequential Related to timeseries datasets and removed pending review This issue needs to be further reviewed, so work cannot be started labels Jul 14, 2022
@npatki
Copy link
Contributor

npatki commented Oct 17, 2022

Circling back to this -- rather than make LSTMDetection work for single sequence data (which isn't really obvious), I've created a new feature request to track the creation of new metrics specifically for the single sequence case.

Let's defer to this new issue #246 to discuss new potential metrics. In the meantime, I'll close this off as working as intended (LSTMDetection was never intended to work when there is a single sequence).

@npatki npatki closed this as completed Oct 17, 2022
@npatki npatki added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:sequential Related to timeseries datasets resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants