Add correlation similarity metric #158

katxiao · 2022-07-14T16:30:03Z

Resolves #143

codecov-commenter · 2022-07-14T16:41:12Z

Codecov Report

Merging #158 (5846cc6) into master (5ef643f) will increase coverage by 0.62%.
The diff coverage is 83.67%.

@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
+ Coverage   58.06%   58.68%   +0.62%     
==========================================
  Files          59       60       +1     
  Lines        1774     1813      +39     
==========================================
+ Hits         1030     1064      +34     
- Misses        744      749       +5

Impacted Files	Coverage Δ
.../single_column/statistical/statistic_similarity.py	`73.52% <66.66%> (-0.76%)`	⬇️
...column_pairs/statistical/correlation_similarity.py	`84.84% <84.84%> (ø)`
sdmetrics/column_pairs/base.py	`91.66% <100.00%> (+2.77%)`	⬆️
sdmetrics/column_pairs/statistical/__init__.py	`100.00% <100.00%> (ø)`
sdmetrics/single_table/multi_column_pairs.py	`65.85% <100.00%> (+2.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ef643f...5846cc6. Read the comment docs.

amontanez24

Some minor comments about datetime handling/testing. Besides that this looks good!

amontanez24 · 2022-07-14T20:20:01Z

sdmetrics/column_pairs/statistical/correlation_similarity.py

+        synthetic_data[pd.isna(synthetic_data)] = 0.0
+        column1, column2 = real_data.columns[:2]
+
+        if is_datetime(real_data):


should we error if synthetic data isn't also datetime?

Hm so I have a couple thoughts -

This handling is used in many metrics across the column_pairs and single_column metrics. I do think it would be nice to add more comprehensive validation. I'm not sure if it's worth adding it in this PR, since that will make it less unified.

Most of the usage should be through single table metrics, which filters all the metrics by data type, and only applies the relevant metrics. In this use case, we should always be applying the correct metric for a given data type. That's why I don't think it's a huge issue right now. Of course, people can still choose to directly invoke the column pair metric.

I'm in favor of opening another issue around adding data type verification on the base classes of ColumnPairMetric and SingleColumnMetric, and addressing it for all metrics. Let me know what you think.

Makes sense to me

Cool, added an issue for tracking here: #168

amontanez24 · 2022-07-14T20:23:30Z

tests/utils.py

+import pandas as pd
+
+
+class DataFrameMatcher:


are these classes being added for future use?

Yeah I just added it here in case we need it in the future, since it seemed likely that we would match a data frame at some point.

amontanez24 · 2022-07-14T20:24:13Z

tests/unit/column_pairs/statistical/test_correlation_similarity.py

+class TestCorrelationSimilarity:
+
+    @patch('sdmetrics.column_pairs.statistical.correlation_similarity.pearsonr')
+    def test_compute_breakdown(self, pearson_mock):


can we do an example with datetime columns?

katxiao requested a review from a team as a code owner July 14, 2022 16:30

katxiao requested review from amontanez24 and removed request for a team July 14, 2022 16:30

amontanez24 approved these changes Jul 14, 2022

View reviewed changes

katxiao added 5 commits July 20, 2022 14:38

Add correlation similarity to column pairs and update

9921609

Add unit test for column pairs

41b04b5

Add single table metric

4720fab

Update metric

e0516c3

Add unit test

5846cc6

katxiao force-pushed the issue-143-add-correlation-similarity-metric branch from 0141130 to 5846cc6 Compare July 20, 2022 18:48

katxiao merged commit 3c72e85 into master Jul 20, 2022

katxiao deleted the issue-143-add-correlation-similarity-metric branch July 20, 2022 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add correlation similarity metric #158

Add correlation similarity metric #158

katxiao commented Jul 14, 2022

codecov-commenter commented Jul 14, 2022 •

edited

Loading

amontanez24 left a comment

amontanez24 Jul 14, 2022

katxiao Jul 18, 2022 •

edited

Loading

amontanez24 Jul 19, 2022

katxiao Jul 20, 2022

amontanez24 Jul 14, 2022

katxiao Jul 18, 2022

amontanez24 Jul 14, 2022

		import pandas as pd


		class DataFrameMatcher:

Add correlation similarity metric #158

Add correlation similarity metric #158

Conversation

katxiao commented Jul 14, 2022

codecov-commenter commented Jul 14, 2022 • edited Loading

Codecov Report

amontanez24 left a comment

Choose a reason for hiding this comment

amontanez24 Jul 14, 2022

Choose a reason for hiding this comment

katxiao Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

amontanez24 Jul 19, 2022

Choose a reason for hiding this comment

katxiao Jul 20, 2022

Choose a reason for hiding this comment

amontanez24 Jul 14, 2022

Choose a reason for hiding this comment

katxiao Jul 18, 2022

Choose a reason for hiding this comment

amontanez24 Jul 14, 2022

Choose a reason for hiding this comment

codecov-commenter commented Jul 14, 2022 •

edited

Loading

katxiao Jul 18, 2022 •

edited

Loading