-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add correlation similarity metric #158
Conversation
Codecov Report
@@ Coverage Diff @@
## master #158 +/- ##
==========================================
+ Coverage 58.06% 58.68% +0.62%
==========================================
Files 59 60 +1
Lines 1774 1813 +39
==========================================
+ Hits 1030 1064 +34
- Misses 744 749 +5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments about datetime handling/testing. Besides that this looks good!
synthetic_data[pd.isna(synthetic_data)] = 0.0 | ||
column1, column2 = real_data.columns[:2] | ||
|
||
if is_datetime(real_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we error if synthetic data isn't also datetime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm so I have a couple thoughts -
- This handling is used in many metrics across the
column_pairs
andsingle_column
metrics. I do think it would be nice to add more comprehensive validation. I'm not sure if it's worth adding it in this PR, since that will make it less unified. - Most of the usage should be through single table metrics, which filters all the metrics by data type, and only applies the relevant metrics. In this use case, we should always be applying the correct metric for a given data type. That's why I don't think it's a huge issue right now. Of course, people can still choose to directly invoke the column pair metric.
I'm in favor of opening another issue around adding data type verification on the base classes of ColumnPairMetric
and SingleColumnMetric
, and addressing it for all metrics. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, added an issue for tracking here: #168
import pandas as pd | ||
|
||
|
||
class DataFrameMatcher: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these classes being added for future use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just added it here in case we need it in the future, since it seemed likely that we would match a data frame at some point.
class TestCorrelationSimilarity: | ||
|
||
@patch('sdmetrics.column_pairs.statistical.correlation_similarity.pearsonr') | ||
def test_compute_breakdown(self, pearson_mock): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do an example with datetime columns?
0141130
to
5846cc6
Compare
Resolves #143