Skip to content

Conversation

riley-harper
Copy link
Contributor

@riley-harper riley-harper commented Aug 15, 2025

This closes #221.
It also closes #224, a CI/CD bug.

Previously, we were manually setting the seeds for Spark's built-in ML models but not for XGBoost and LightGBM. This inconsistency is an oversight I made while adding XGBoost and LightGBM. Since we weren't setting the seed for XGBoost or LightGBM, the models trained by these libraries were slightly different on each run of hlink. This caused some inconsistent results from matching.

Also, the manual setting of the seeds for the Spark models did not allow users to pass in their own seeds, so they were stuck with the single seed we had chosen.

Now all of these models are handled uniformly. We accept the seed set by the user if there is one. If there is no seed in the params dictionary, then we add a "seed": 2133 entry before passing the parameters to the classifier. This fixes both issues.

We recently got automatically updated to Debian trixie with Java 21. But
that seems to cause problems for the current version of XGBoost.
@riley-harper riley-harper merged commit 159f8da into main Aug 18, 2025
6 checks passed
@riley-harper riley-harper deleted the classifier_seeds branch August 18, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI/CD is failing with a Java-related error Inconsistent matching results with XGBoost

1 participant