Add the XGBoost ML library #161

riley-harper · 2024-11-14T20:19:13Z

This is a new IPUMS-motivated feature. We would like to integrate the XGBoost library into hlink so that you can use it like any of the other ML algorithms already available. Since XGBoost-Spark integration is currently experimental and since XGBoost has some other dependencies (libomp and pyarrow), we would like to make this an opt-in feature so that you don't need to have the XGBoost dependencies unless you want to use xgboost.

We intend to do this by adding a new optional-dependencies section to pyproject.toml.

[optional-dependencies]
dev = ...
xgboost = ["xgboost", "pyarrow>=4"]

Then in hlink/linking/core/classifier.py, we can try to import xgboost, but not error out if it's not available.

try:
    import xgboost
except ModuleNotFoundError:
    xgboost_available = False
else:
    xgboost_available = True

And if a user sets their model_type = "xgboost", we can confirm that the package is present and error out if it's not.

...
elif model_type == "xgboost":
    if not xgboost_available:
         raise ModuleNotFoundError("could not find xgboost, please install it")
    ...

My biggest question with this setup is how we test the feature. Do we just always install xgboost when running tests? Should we have two test environments, one with xgboost and one without?

To Do List

The text was updated successfully, but these errors were encountered:

riley-harper · 2024-11-14T20:38:00Z

For the tests, maybe a good place to start is to write tests which skip themselves when xgboost isn't present. We can use pytest.mark.skip for this. Then we can make sure to install xgboost in the CI/CD tests, but the tests should still pass even if xgboost isn't available.

This doesn't really help us test the condition where xgboost is not available. This test would just make sure that we raise an exception that makes sense. I don't really want to add tests which are skipped if xgboost is available. That seems confusing and encourages fiddling with the environment back and forth as you run tests. Maybe we just won't write tests for this error case to start with.

This test is currently failing if you have xgboost installed. If you don't have xgboost installed, it skips itself to prevent failures due to missing packages and dependencies.

riley-harper · 2024-11-14T21:54:09Z

I suppose even if we skip the tests when xgboost isn't available, we probably want to run the tests with and without xgboost to make sure that hlink runs as expected when xgboost isn't installed. Looking around on the internet, I've seen "tox" and "pytest-xdist" mentioned as tools that can help with this.

riley-harper · 2024-11-14T21:59:02Z

Another option is to make use of GitHub Actions matrix builds.

matrix:
        python_version: ["3.10", "3.11", "3.12"]
        include_extras: ["no", "yes"]

Edit: This might be a bit easier and more flexible:

matrix:
    python_version: ["3.10", "3.11", "3.12"]
    hlink_extras: ["dev", "dev,xgboost"]

…ype xgboost This is only possible when we have the xgboost module, so raise an error if that is not present.

…odel This test is failing right now because we also need pyarrow>=4 when using xgboost. We should add this as a dependency in the xgboost extra. If xgboost isn't installed, this test skips itself.

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

I've also updated pytest to be more verbose for clarity.

Like some of the other models, xgboost returns an array of probabilities like [probability_no, probability_yes]. So we extract just probability_yes as our probability for hlink purposes.

xgboost has a different setup for feature importances, so the current logic ignores it. We'll need to update the save model metadata step to include logic specifically for xgboost.

This is really different from the Spark models, so I've made it a special case instead of trying to integrate it with the previous logic closely. This section might be due for some refactoring now.

This also updates Alabaster to 1.0.0.

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

This merges the xgboost and lightgbm branches together. There were several files with conflicts. Most of the conflicts I resolved by keeping the work from both branches.

We now compute two feature importances for each model. - weight: the number of splits that each feature causes - gain: the total gain across all of each feature's splits

I'm still not entirely happy with this, but it's a tricky point in the code because most of the models behave one way, but xgboost and lightgbm are different. Some more refactoring might be in order.

… and lightgbm

- It turns out that multi-line TOML tables aren't allowed. So let's use the [training.chosen_model] syntax instead. - I clarified the introductory information and made it general enough to apply to XGBoost and LightGBM as well.

riley-harper added the enhancement label Nov 14, 2024

riley-harper added a commit that referenced this issue Nov 14, 2024

[#161] Add xgboost as an optional dependency

fbb4b44

riley-harper added a commit that referenced this issue Nov 14, 2024

[#161] Run black

010f3f5

riley-harper added a commit that referenced this issue Nov 14, 2024

[#161] Ignore flake8 unused import error

a865825

riley-harper added a commit that referenced this issue Nov 14, 2024

[#161] Create a SparkXGBClassifier in choose_classifier() for model_t…

287912e

…ype xgboost This is only possible when we have the xgboost module, so raise an error if that is not present.

riley-harper added a commit that referenced this issue Nov 15, 2024

[#161] Update the Dockerfile to support build with different hlink ex…

5c6fdc9

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

riley-harper added a commit that referenced this issue Nov 15, 2024

[#161] Update docker-build.yml to run tests with and without xgboost

a259811

I've also updated pytest to be more verbose for clarity.

riley-harper added a commit that referenced this issue Nov 15, 2024

[#161] Add pyarrow as a dependency for the xgboost extra

a95992c

riley-harper added a commit that referenced this issue Nov 15, 2024

[#161] Factor conditional xgboost test logic into a single marker

c64cf43

riley-harper added a commit that referenced this issue Nov 18, 2024

[#161] Pull column and category logic before feature importances logic

7423169

riley-harper added a commit that referenced this issue Nov 18, 2024

[#161] Rename a variable in training step 3

0277d7d

riley-harper added a commit that referenced this issue Nov 18, 2024

[#161] Make the "xgboost is missing" error message more helpful

3ca3952

riley-harper added a commit that referenced this issue Nov 18, 2024

[#161] Update the README with information on XGBoost

b992ba5

riley-harper added a commit that referenced this issue Nov 18, 2024

[#161] Add information about xgboost to models.md

3065310

riley-harper mentioned this issue Nov 19, 2024

Add the LightGBM ML library #162

Closed

10 tasks

riley-harper added a commit that referenced this issue Nov 19, 2024

[#161] Regenerate Sphinx docs

ab1d83a

This also updates Alabaster to 1.0.0.

riley-harper added a commit that referenced this issue Nov 19, 2024

[#161] Update the Dockerfile to support build with different hlink ex…

48a93ef

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

riley-harper added a commit that referenced this issue Nov 21, 2024

[#161, #162] Rename some variables and add logging in training step 3

7864432

riley-harper added a commit that referenced this issue Nov 22, 2024

[#161, #162] Match up the "missing module" error messages for xgboost…

987f71c

… and lightgbm

riley-harper added a commit that referenced this issue Nov 25, 2024

[#161, #162] Update the README with docs on xgboost and lightgbm

aeaef93

riley-harper mentioned this issue Nov 25, 2024

Add support for XGBoost and LightGBM #165

Merged

3 tasks

riley-harper added type: feature A new feature or enhancement to a feature and removed enhancement labels Dec 4, 2024

riley-harper closed this as completed in #165 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the XGBoost ML library #161

Add the XGBoost ML library #161

riley-harper commented Nov 14, 2024 •

edited

Loading

riley-harper commented Nov 14, 2024

riley-harper commented Nov 14, 2024

riley-harper commented Nov 14, 2024 •

edited

Loading

Add the XGBoost ML library #161

Add the XGBoost ML library #161

Comments

riley-harper commented Nov 14, 2024 • edited Loading

To Do List

riley-harper commented Nov 14, 2024

riley-harper commented Nov 14, 2024

riley-harper commented Nov 14, 2024 • edited Loading

riley-harper commented Nov 14, 2024 •

edited

Loading

riley-harper commented Nov 14, 2024 •

edited

Loading