Skip to content

Privacy Metrics error if target column has missing values #135

Open
@npatki

Description

@npatki

Environment Details

  • SDV version: 0.13.0
  • Python version: 3.8.9
  • Operating System: MacOS

Error Description

The Numerical Privacy Metrics throw an error whenever the target columns (sensitive_fields) contain missing values.

Steps to Reproduce

Go through the User Guide to import & load data. Then, scroll down to the Privacy Metrics section.

The following code should work as-is according to the user guide.

NumericalLR.compute( real_data, synthetic_data,
    key_fields=['second_perc', 'mba_perc', 'degree_perc'],
    sensitive_fields=['salary'])

However, when I try to run this, I get an error from sklearn because the salary column contains NaN values:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Note: The same error is thrown when any of the key_fields containing missing values too. Eg. if I switch around salary and degree_perc in the above example.

Suggested Fix

This used to work, so either this was a recent change on SDV or in sklearn. What were we doing before? Were we dropping the NaN values, filling them or imputing them?

Also, maybe it's ok if it crashes upon first running. Maybe the user can re-run with a flag for handling them missing values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions