New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Refactor nested cross validation #169

Merged

ccdavis merged 45 commits into v4-dev from refactor-nested-cross-validation

Dec 4, 2024

ccdavis commented Dec 3, 2024

Changes the algorithm used in link_step_train_test_models to the "nested cross validation" approach.
Added a pseudo-code description of the nested-cv algorithm in the comments.
The _run method on LinkStepTrainTestModels has been refactored to only do the setup for the run and then the main algo, with the rest factored into other methods. We should probably further refactor so the cross-validation behavior is in its own pure class or module and not rely on class 'self' instance data so much. Most functions only need config information.
The way in which I used cache() and unpersist() on Spark data frames is likely not optimal yet.

NOTE: Currently reporting of the final results isn't totally finished; we get one data frame for each outer fold with all results of every threshold combination in each. These outer-fold results still need to be merged to give a true picture of how each threshold combination produces matches. The old algorithm computed precision mean and recall mean and MCC means in a way that doesn't make sense for the new algorithm.

At this stage I'm primarily concerned with performance.

ccdavis and others added 30 commits

November 14, 2024 15:12


          Messing around with refactoring model exploration

5507b4b


          Fixed failures due to bad code

3b84f26


          No errors, use model exploration approach that should get pr_auc mean…

62ff6e6

… and test all threshold matrix members against that set of params. Still has a failure.


          remove cache() and typo

3477b71


          Renaming for clarity

c0397c5

wip

1fe6224


          giving up for now

28c6cde

wip

1f70f66


          refactoring

8e5415f


          finished refactoring sketch

941bd06


          Fixed some typos

1f2bd49


          correctly save suspicious data

21cac61


          Debugging _get_aggregates in test. It looks like the test data just d…

c9576e8

…oesn't give good results making no matches in the test data, so precision is NaN.


          Use all splits on thresholding

319129f


          Adjust test to account for results with only the best hyper parameter…

9a90143

…s given to the thresholding eval.


          Clean up stdout and make a model-param selection report.

a14ccdf


          model exploration tests pass; need more

2facf41


          Separate each fold test run output.

3bbac41


          Clean up output

3b22f14


          Tests pass

efa67f7


          fixed some tests, the FNS count test is broken because of the single …

38c1006

…split used to test all thresholds isn't a good one.

wip

a94250c


          Possibly working nested cv

667d322


          fix typo, testing

f4a42f7


          reformatted

761e38f


          better output for tracking progress of train-test

3e0cb90


          better messages

c7e7ba2


          Better logging

fdd402c


          correctly group threshold metrics by outer fold iteration.

3500e7c


          Try fewer shuffle partitions

1ea05d0

ccdavis added 6 commits

December 3, 2024 10:36


          set shuffle partitions back to 200

10ab7b4


          Added nested-cv algo description in comments.

47e28a6


          Added seed on inner fold splitter; Update tests to at least pass.

b5e128f


          assert the logistic regression gives a decent result

b123dbf


          Temporary commented out asserts due to different results presentation…

1ead1e7

… breaking tests


          another test passes

45f3649

riley-harper requested changes

View reviewed changes

Contributor

riley-harper left a comment

This makes sense to me for the most part. It's a big change from how we were doing it before! I requested some changes, but most are on the smaller side. The broad algorithm looks great to me.

I just found the pyspark.ml.tuning module, and I suspect that we can make use of its CrossValidator, ParamGridBuilder, and metrics here. However, our algorithm is a little different than what the Spark logic does. We may only be able to replace the inner cross-validation with CrossValidator. I'm not sure. CrossValidator has a parallelism: int argument which can help you increase the parallelism of testing. This could be really helpful for speeding things up.

Also, we are doing a lot of work to predict on the training data and compute the metrics on the training data. But I don't think that we need to do that anymore. That was a feature of the previous algorithm. With nested cross-validation I think that we just want to compute metrics on the test data. This could also speed things up significantly (I would guess by around a factor of 2, maybe more).

I did not look at the tests, since it sounds like we are really going to need to rework those.

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

+                  score: float
+                  hyperparams: dict[str, Any]
+                  threshold: float | list[float]
+                  threshold_ratio: float | list[float] | bool

Contributor

riley-harper Dec 3, 2024

I think that it's a bug to store threshold_ratio as a bool. It's an optional float/list of floats, so I think that we should store it as float | list[float] | None. The code that extracts it out of the config file shouldn't make it default to False.

Author

ccdavis Dec 3, 2024

Updated to set None .

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

Comment on lines 180 to 182

+                      predict_train_tmp = _get_probability_and_select_pred_columns(
+                          training_data, model, post_transformer, id_a, id_b, dep_var
+                      )

Contributor

riley-harper Dec 3, 2024

Do we need to retain this logic? Since we're doing nested cross-validation, I think that we should avoid predicting on the training data. We can just predict on the test data. This may save us a significant amount of work.

Author

ccdavis Dec 3, 2024

Removed.

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

+                      test_pred = predictions_tmp.toPandas()
+                      precision, recall, thresholds_raw = precision_recall_curve(
+                          test_pred[f"{dep_var}"],

Contributor

riley-harper Dec 3, 2024

I'm pretty sure that dep_var is a str everywhere, so we can just write

test_pred[dep_var]

to simplify things.

Author

ccdavis Dec 3, 2024

Yes.

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

Comment on lines 247 to 248

		config,
		training_conf,

Contributor

riley-harper Dec 3, 2024

These names are pretty confusing. Maybe we can rename them to make it clear that one is the config dictionary and one is the name of the training config section. Or maybe we could just pass the training part of the dictionary to this function.

Author

ccdavis Dec 3, 2024

I did both; consolidate the passing of config + training_conf (name) to the functions and instead pull out the training dict and calling it "training_settings". This simplifies accessing the training settings in places to be a bit cleaner and the names are more understandable. Really we need to remove the reliance on the config structure all over the place but that's a bigger change.

hlink/linking/model_exploration/link_step_train_test_models.py

Comment on lines +271 to +272

		# thresholds and model_type are mixed in with the model hyper-parameters
		# in the config; this removes them before passing to the model training.

Contributor

riley-harper Dec 3, 2024

Great comments, thanks for adding these!

hlink/linking/model_exploration/link_step_train_test_models.py

Comment on lines +398 to +405

+                      thresholding_predict_train = _get_probability_and_select_pred_columns(
+                          cached_training_data,
+                          thresholding_model,
+                          thresholding_post_transformer,
+                          id_a,
+                          id_b,
+                          dep_var,
+                      )

Contributor

riley-harper Dec 3, 2024

I believe that we can drop this logic since it's computing metrics on the training data.

Author

ccdavis Dec 3, 2024

We can at the least add a flag to not do metrics on the training data and separate the capture of the test results and training data results. Leaving it alone for now until we refactor this part.

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

Comment on lines 426 to 432

+                          predict_train = threshold_core.predict_using_thresholds(
+                              thresholding_predict_train,
+                              this_alpha_threshold,
+                              this_threshold_ratio,
+                              config[training_conf],
+                              config["id_column"],
+                          )

Contributor

riley-harper Dec 3, 2024

Eliminate this since it's on the training data.

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

Comment on lines 486 to 487

		outer_fold_count = config[training_conf].get("n_training_iterations", 10)
		inner_fold_count = 3

Contributor

riley-harper Dec 3, 2024

We may want to rename n_training_iterations to num_outer_folds and add a num_inner_folds attribute to the config.

Author

ccdavis Dec 3, 2024

Yes, definitely. Let's make that another PR soon.

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

Comment on lines 489 to 490

		if outer_fold_count < 3:
		raise RuntimeError("You must use at least two training iterations.")

Contributor

riley-harper Dec 3, 2024

The error message and if statement don't seem to line up here. Do you need at least 2 or at least 3 iterations?

Author

ccdavis Dec 3, 2024

Yes, it's three, not two. Fixed.

hlink/linking/model_exploration/link_step_train_test_models.py Outdated

		@@ -429,7 +830,7 @@ def _save_otd_data(
		print("There were no true negatives recorded.")

		def _create_otd_data(self, id_a: str, id_b: str) -> dict[str, Any] \| None:

Contributor

riley-harper Dec 3, 2024

This is a good opportunity to rename the "OTD data" to something easier to understand. Maybe "suspicious data" would be clearer?

Author

ccdavis Dec 3, 2024

Yes, I replaced "otd" with "suspicious".

ccdavis and others added 9 commits

December 3, 2024 12:57


          all tests should pass

40f075d


          Allow manually running CI/CD via on: workflow_dispatch

02bc77c


          Merge pull request #170 from ipums/manual_ci_trigger

37e3a57

Allow manually running CI/CD via on: workflow_dispatch


          fixed quote indent

b9c2123


          Merge branch 'main' into refactor-nested-cross-validation

40f344e


          Run CI/CD on all PRs, not just PRs to main

6370eda


          Merge pull request #171 from ipums/ci_on_all_prs

71c4fea

Run CI/CD on all PRs, not just PRs to main


          Address PR comments

1e55384


          Merge branch 'main' into refactor-nested-cross-validation

02d5f96

ccdavis merged commit 11bdfd4 into v4-dev

0 of 3 checks passed

riley-harper deleted the refactor-nested-cross-validation branch

December 16, 2024 21:07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet