Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the XGBoost ML library #161

Closed
9 of 10 tasks
riley-harper opened this issue Nov 14, 2024 · 3 comments · Fixed by #165
Closed
9 of 10 tasks

Add the XGBoost ML library #161

riley-harper opened this issue Nov 14, 2024 · 3 comments · Fixed by #165
Labels
type: feature A new feature or enhancement to a feature

Comments

@riley-harper
Copy link
Contributor

riley-harper commented Nov 14, 2024

This is a new IPUMS-motivated feature. We would like to integrate the XGBoost library into hlink so that you can use it like any of the other ML algorithms already available. Since XGBoost-Spark integration is currently experimental and since XGBoost has some other dependencies (libomp and pyarrow), we would like to make this an opt-in feature so that you don't need to have the XGBoost dependencies unless you want to use xgboost.

We intend to do this by adding a new optional-dependencies section to pyproject.toml.

[optional-dependencies]
dev = ...
xgboost = ["xgboost", "pyarrow>=4"]

Then in hlink/linking/core/classifier.py, we can try to import xgboost, but not error out if it's not available.

try:
    import xgboost
except ModuleNotFoundError:
    xgboost_available = False
else:
    xgboost_available = True

And if a user sets their model_type = "xgboost", we can confirm that the package is present and error out if it's not.

...
elif model_type == "xgboost":
    if not xgboost_available:
         raise ModuleNotFoundError("could not find xgboost, please install it")
    ...

My biggest question with this setup is how we test the feature. Do we just always install xgboost when running tests? Should we have two test environments, one with xgboost and one without?

To Do List

  • Create an hlink extra which installs xgboost and its dependencies
  • Integrate xgboost into core/classifier.py
  • Make sure xgboost works with model_exploration
  • Make sure xgboost works with training (especially step 3, save model metadata)
  • Make sure xgboost works with matching
  • Document the new hlink pip extra and additional requirements for xgboost (libomp). Note that xgboost-PySpark integration is still experimental, so it may not be as stable as other hlink features.
  • Document the new model type in the Sphinx docs
  • Make a nice, informative error message for when xgboost isn't available
  • Update the tests to run once without xgboost installed and once with it installed
  • Reduce the verbosity of xgboost logs so it doesn't output to stdout/stderr (can we redirect these to the Python logging system somehow?)
@riley-harper
Copy link
Contributor Author

For the tests, maybe a good place to start is to write tests which skip themselves when xgboost isn't present. We can use pytest.mark.skip for this. Then we can make sure to install xgboost in the CI/CD tests, but the tests should still pass even if xgboost isn't available.

This doesn't really help us test the condition where xgboost is not available. This test would just make sure that we raise an exception that makes sense. I don't really want to add tests which are skipped if xgboost is available. That seems confusing and encourages fiddling with the environment back and forth as you run tests. Maybe we just won't write tests for this error case to start with.

riley-harper added a commit that referenced this issue Nov 14, 2024
This test is currently failing if you have xgboost installed. If you don't have
xgboost installed, it skips itself to prevent failures due to missing packages
and dependencies.
riley-harper added a commit that referenced this issue Nov 14, 2024
@riley-harper
Copy link
Contributor Author

I suppose even if we skip the tests when xgboost isn't available, we probably want to run the tests with and without xgboost to make sure that hlink runs as expected when xgboost isn't installed. Looking around on the internet, I've seen "tox" and "pytest-xdist" mentioned as tools that can help with this.

@riley-harper
Copy link
Contributor Author

riley-harper commented Nov 14, 2024

Another option is to make use of GitHub Actions matrix builds.

matrix:
        python_version: ["3.10", "3.11", "3.12"]
        include_extras: ["no", "yes"]

Edit: This might be a bit easier and more flexible:

matrix:
    python_version: ["3.10", "3.11", "3.12"]
    hlink_extras: ["dev", "dev,xgboost"]

riley-harper added a commit that referenced this issue Nov 14, 2024
…ype xgboost

This is only possible when we have the xgboost module, so raise an error if
that is not present.
riley-harper added a commit that referenced this issue Nov 15, 2024
…odel

This test is failing right now because we also need pyarrow>=4 when using
xgboost. We should add this as a dependency in the xgboost extra. If xgboost
isn't installed, this test skips itself.
riley-harper added a commit that referenced this issue Nov 15, 2024
…tras

This should let us have two different test setups for each Python version. One
with xgboost, one without.
riley-harper added a commit that referenced this issue Nov 15, 2024
I've also updated pytest to be more verbose for clarity.
riley-harper added a commit that referenced this issue Nov 15, 2024
Like some of the other models, xgboost returns an array of probabilities like
[probability_no, probability_yes]. So we extract just probability_yes as our
probability for hlink purposes.
riley-harper added a commit that referenced this issue Nov 18, 2024
xgboost has a different setup for feature importances, so the current logic
ignores it. We'll need to update the save model metadata step to include logic
specifically for xgboost.
riley-harper added a commit that referenced this issue Nov 18, 2024
This is really different from the Spark models, so I've made it a special case
instead of trying to integrate it with the previous logic closely. This section
might be due for some refactoring now.
riley-harper added a commit that referenced this issue Nov 19, 2024
This also updates Alabaster to 1.0.0.
riley-harper added a commit that referenced this issue Nov 19, 2024
…tras

This should let us have two different test setups for each Python version. One
with xgboost, one without.
riley-harper added a commit that referenced this issue Nov 21, 2024
This merges the xgboost and lightgbm branches together. There were several
files with conflicts. Most of the conflicts I resolved by keeping the work from
both branches.
riley-harper added a commit that referenced this issue Nov 21, 2024
We now compute two feature importances for each model.

- weight: the number of splits that each feature causes
- gain: the total gain across all of each feature's splits
riley-harper added a commit that referenced this issue Nov 21, 2024
I'm still not entirely happy with this, but it's a tricky point in the code
because most of the models behave one way, but xgboost and lightgbm are
different. Some more refactoring might be in order.
riley-harper added a commit that referenced this issue Nov 25, 2024
- It turns out that multi-line TOML tables aren't allowed. So let's use the
  [training.chosen_model] syntax instead.
- I clarified the introductory information and made it general enough to apply
  to XGBoost and LightGBM as well.
@riley-harper riley-harper added type: feature A new feature or enhancement to a feature and removed enhancement labels Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature or enhancement to a feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant