Remove "suspicious data" functionality from model exploration #178

riley-harper · 2024-12-10T21:35:59Z

This is for #176.

I've removed all of the code for keeping track of and saving the suspicious data, and I've removed output_suspicious_TD from the docs. I took this opportunity to rewrite _get_confusion_matrix() to use a single select() instead of 4 filter() + count()s. I did a little bit of initial profiling, and I think this should be a pretty significant speedup.

I renamed some variables which had capital letters like TP and FN, replacing those with their lowercase counterparts, or spelling out "true_positives" and "false_negatives", etc.

…pport

Using a single select() should let us take better advantage of Spark's parallel/distributed computing. My initial results profiling this are pretty promising.

riley-harper · 2024-12-10T22:16:06Z

The tests were failing because of an update to scikit-learn 1.6.0, which came out yesterday. It's only a problem with xgboost, so I've added an additional requirement to the xgboost extra, and that seems to have fixed things.

ccdavis

Looks good, just what I expected. The get_confusion_matrix() is much better.

riley-harper added 5 commits December 10, 2024 14:00

[#176] Remove output_suspicious_TD and "suspicious traininig data" su…

b7f821c

…pport

[#176] Add a unit test for _get_confusion_matrix()

9755f73

[#176] Rewrite _get_confusion_matrix() to avoid using 4 filters + counts

c43b57d

Using a single select() should let us take better advantage of Spark's parallel/distributed computing. My initial results profiling this are pretty promising.

[#176] Add a unit test for _get_aggregate_metrics()

4aad62e

[#176] Lowercase tp/fp/fn/tn variable names

3efbb0c

riley-harper requested a review from ccdavis December 10, 2024 21:36

Try requiring scikit-learn<1.6 when xgboost is installed

627eed8

ccdavis approved these changes Dec 11, 2024

View reviewed changes

riley-harper merged commit c1f0d8c into v4-dev Dec 11, 2024
6 checks passed

riley-harper deleted the no-suspicious-data branch December 11, 2024 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove "suspicious data" functionality from model exploration #178

Remove "suspicious data" functionality from model exploration #178

riley-harper commented Dec 10, 2024

riley-harper commented Dec 10, 2024

ccdavis left a comment

Remove "suspicious data" functionality from model exploration #178

Remove "suspicious data" functionality from model exploration #178

Conversation

riley-harper commented Dec 10, 2024

riley-harper commented Dec 10, 2024

ccdavis left a comment

Choose a reason for hiding this comment