-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the XGBoost ML library #161
Comments
For the tests, maybe a good place to start is to write tests which skip themselves when xgboost isn't present. We can use pytest.mark.skip for this. Then we can make sure to install xgboost in the CI/CD tests, but the tests should still pass even if xgboost isn't available. This doesn't really help us test the condition where xgboost is not available. This test would just make sure that we raise an exception that makes sense. I don't really want to add tests which are skipped if xgboost is available. That seems confusing and encourages fiddling with the environment back and forth as you run tests. Maybe we just won't write tests for this error case to start with. |
This test is currently failing if you have xgboost installed. If you don't have xgboost installed, it skips itself to prevent failures due to missing packages and dependencies.
I suppose even if we skip the tests when xgboost isn't available, we probably want to run the tests with and without xgboost to make sure that hlink runs as expected when xgboost isn't installed. Looking around on the internet, I've seen "tox" and "pytest-xdist" mentioned as tools that can help with this. |
Another option is to make use of GitHub Actions matrix builds.
Edit: This might be a bit easier and more flexible:
|
…ype xgboost This is only possible when we have the xgboost module, so raise an error if that is not present.
…odel This test is failing right now because we also need pyarrow>=4 when using xgboost. We should add this as a dependency in the xgboost extra. If xgboost isn't installed, this test skips itself.
…tras This should let us have two different test setups for each Python version. One with xgboost, one without.
I've also updated pytest to be more verbose for clarity.
Like some of the other models, xgboost returns an array of probabilities like [probability_no, probability_yes]. So we extract just probability_yes as our probability for hlink purposes.
xgboost has a different setup for feature importances, so the current logic ignores it. We'll need to update the save model metadata step to include logic specifically for xgboost.
This is really different from the Spark models, so I've made it a special case instead of trying to integrate it with the previous logic closely. This section might be due for some refactoring now.
This also updates Alabaster to 1.0.0.
…tras This should let us have two different test setups for each Python version. One with xgboost, one without.
This is a new IPUMS-motivated feature. We would like to integrate the XGBoost library into hlink so that you can use it like any of the other ML algorithms already available. Since XGBoost-Spark integration is currently experimental and since XGBoost has some other dependencies (libomp and pyarrow), we would like to make this an opt-in feature so that you don't need to have the XGBoost dependencies unless you want to use xgboost.
We intend to do this by adding a new
optional-dependencies
section to pyproject.toml.Then in hlink/linking/core/classifier.py, we can try to import xgboost, but not error out if it's not available.
And if a user sets their
model_type = "xgboost"
, we can confirm that the package is present and error out if it's not.My biggest question with this setup is how we test the feature. Do we just always install xgboost when running tests? Should we have two test environments, one with xgboost and one without?
To Do List
The text was updated successfully, but these errors were encountered: