Running a detection metric on time series data with no entity_columns fails #77

JacekCala · 2021-09-16T14:15:38Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDMetrics version: 0.3.2
Python version: 3.8.6
Operating System: CentOS

Error Description

My data is a sequence of a single type with no entity_columns set. As described in the user guide, model training works fine, however, I'm unable to run the detection metrics to check the "goodness" of fit. The error I get is ValueError: No group keys passed!

Looking into sources, indeed there's the problem as the code relies on the entity_columns variable:

SDMetrics/sdmetrics/timeseries/detection.py

Lines 37 to 49 in dd00b17

    
           @staticmethod 
        
           def _build_x(data, transformer, entity_columns): 
        
               X = pd.DataFrame() 
        
               for entity_id, entity_data in data.groupby(entity_columns): 
        
                   entity_data = entity_data.drop(entity_columns, axis=1) 
        
                   entity_data = transformer.transform(entity_data) 
        
                   entity_data = pd.Series({ 
        
                       column: entity_data[column].values 
        
                       for column in entity_data.columns 
        
                   }, name=entity_id) 
        
                   X = X.append(entity_data) 
        
               return X

I'm wondering whether a simple change can fix the problem correcly (see below). Could you please confirm if this is the right way of thinking?

This is change in _build_x:

    def _build_x(data, transformer, entity_columns):
        X = pd.DataFrame()
        if entity_columns:
            for entity_id, entity_data in data.groupby(entity_columns):
                # code as in the original detection.py L41-47...

        else:
            entity_data = transformer.transform(data)
            entity_data = pd.Series({
                column: entity_data[column].values
                for column in entity_data.columns
            })
            X = pd.DataFrame([entity_data])

        return X

and one more in the compute method, line

SDMetrics/sdmetrics/timeseries/detection.py

Line 90 in dd00b17

    
           X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)

to be changed to:

       if entity_columns:
            X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)
        else:
            X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

Steps to reproduce

A simple example based on the user guide:

from sdv.demo import load_timeseries_demo
from sdv.timeseries import PAR
from sdv.metrics.timeseries import LSTMDetection

data = load_timeseries_demo()
no_context = data[['Symbol', 'Date', 'Open', 'Close', 'Volume']].copy()
real_data = no_context[no_context.Symbol == 'TSLA'].copy()
del real_data['Symbol']

sequence_index = 'Date'
model = PAR(sequence_index = sequence_index)

print('Fitting model...')
model.fit(real_data)

print('Sampling data...')
synth_data = model.sample()

print('Running evaluation...')
val = LSTMDetection.compute(real_data, synth_data, metadata={
    'sequence_index': sequence_index,
    'fields': {
        'Date': {'type': 'datetime'},
        'Open': {'type': 'numerical', 'subtype': 'float'},
        'Close': {'type': 'numerical', 'subtype': 'float'},
        'Volume': {'type': 'numerical', 'subtype': 'integer'}
    }})

Once this is run, the error traceback is as follows:

     val = LSTMDetection.compute(real_data, synth_data, metadata={
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 85, in compute
    real_x = cls._build_x(real_data, transformer, entity_columns)
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 40, in _build_x
    for entity_id, entity_data in data.groupby(entity_columns):
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\frame.py", line 6515, in groupby
    return DataFrameGroupBy(
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\groupby.py", line 525, in __init__
    grouper, exclusions, obj = get_grouper(
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\grouper.py", line 821, in get_grouper
    raise ValueError("No group keys passed!")
ValueError: No group keys passed!

The text was updated successfully, but these errors were encountered:

npatki · 2022-07-14T20:42:49Z

Hi @JacekCala, thanks for the detailed explanation and stack traces.

If your dataset has no entity_columns, then it represents just a single sequence. In that case, the synthetic sequence should have a direct correlation with the real sequence. At this point, it may make more sense for you to apply the single table metrics because they directly compare the correlation values and columns shapes.

These timeseries metrics are specially considering multi-sequence data. With multi-sequence data, a synthetic sequence cannot be directly mapped or compared to a specific real sequence, so a more general metric (such as LSTMDetection) is needed. But if you only have a single sequence, I'm not sure that this metric will be that useful to you anyways.

Let me know what you think!

npatki · 2022-10-17T21:25:52Z

Circling back to this -- rather than make LSTMDetection work for single sequence data (which isn't really obvious), I've created a new feature request to track the creation of new metrics specifically for the single sequence case.

Let's defer to this new issue #246 to discuss new potential metrics. In the meantime, I'll close this off as working as intended (LSTMDetection was never intended to work when there is a single sequence).

JacekCala added bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started labels Sep 16, 2021

JacekCala changed the title ~~Trying to run the detection metric on time series data with no entity_columns fails~~ Trying to run a detection metric on time series data with no entity_columns fails Sep 16, 2021

JacekCala changed the title ~~Trying to run a detection metric on time series data with no entity_columns fails~~ Running a detection metric on time series data with no entity_columns fails Sep 16, 2021

npatki added under discussion Issue is currently being discussed data:sequential Related to timeseries datasets and removed pending review This issue needs to be further reviewed, so work cannot be started labels Jul 14, 2022

This was referenced Oct 17, 2022

Time Series LSTMDetection and TSFCDetection Metrics sdv-dev/SDV#487

Closed

Create metrics for a single sequence of data #246

Open

npatki closed this as completed Oct 17, 2022

npatki added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running a detection metric on time series data with no entity_columns fails #77

Running a detection metric on time series data with no entity_columns fails #77

JacekCala commented Sep 16, 2021

npatki commented Jul 14, 2022

npatki commented Oct 17, 2022

Running a detection metric on time series data with no entity_columns fails #77

Running a detection metric on time series data with no entity_columns fails #77

Comments

JacekCala commented Sep 16, 2021

Environment Details

Error Description

Steps to reproduce

npatki commented Jul 14, 2022

npatki commented Oct 17, 2022