Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for XGBoost and LightGBM #165

Merged
merged 48 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
fbb4b44
[#161] Add xgboost as an optional dependency
riley-harper Nov 14, 2024
a51f20f
[#161] Add a test for xgboost classifier support
riley-harper Nov 14, 2024
010f3f5
[#161] Run black
riley-harper Nov 14, 2024
a865825
[#161] Ignore flake8 unused import error
riley-harper Nov 14, 2024
287912e
[#161] Create a SparkXGBClassifier in choose_classifier() for model_t…
riley-harper Nov 14, 2024
a7b0c37
[#161] Add a test that runs the whole training task with an xgboost m…
riley-harper Nov 15, 2024
5c6fdc9
[#161] Update the Dockerfile to support build with different hlink ex…
riley-harper Nov 15, 2024
a259811
[#161] Update docker-build.yml to run tests with and without xgboost
riley-harper Nov 15, 2024
a95992c
[#161] Add pyarrow as a dependency for the xgboost extra
riley-harper Nov 15, 2024
c64cf43
[#161] Factor conditional xgboost test logic into a single marker
riley-harper Nov 15, 2024
88d7199
[#161] Add an integration test for xgboost, set the post-transformer
riley-harper Nov 15, 2024
97aa7e2
[#161] Update test to check xgboost training_feature_importances
riley-harper Nov 18, 2024
7423169
[#161] Pull column and category logic before feature importances logic
riley-harper Nov 18, 2024
ffba81a
[#161] Support saving model metadata for xgboost
riley-harper Nov 18, 2024
0277d7d
[#161] Rename a variable in training step 3
riley-harper Nov 18, 2024
3ca3952
[#161] Make the "xgboost is missing" error message more helpful
riley-harper Nov 18, 2024
b992ba5
[#161] Update the README with information on XGBoost
riley-harper Nov 18, 2024
3065310
[#161] Add information about xgboost to models.md
riley-harper Nov 18, 2024
ab1d83a
[#161] Regenerate Sphinx docs
riley-harper Nov 19, 2024
59033b2
[#162] Create a lightgbm hlink extra
riley-harper Nov 19, 2024
88956ec
[#162] Create a test for choose_classifier() support for lightgbm
riley-harper Nov 19, 2024
dcafbc0
[#162] Allow model_type lightgbm in choose_classifier()
riley-harper Nov 19, 2024
83f6b5c
[#162] Fix a flake8 error
riley-harper Nov 19, 2024
48a93ef
[#161] Update the Dockerfile to support build with different hlink ex…
riley-harper Nov 15, 2024
1aef721
[#162] Run CI/CD once with lightgbm and once without
riley-harper Nov 19, 2024
72fd83c
[#162] Add two training tests for lightgbm
riley-harper Nov 19, 2024
062ad63
[#162] Add a rough draft of a RenameVectorAttributes transformer
riley-harper Nov 20, 2024
7c34bab
[#162] Implement basic RenameVectorAttributes logic
riley-harper Nov 20, 2024
a4f3534
[#162] Implement RenameVectorAttributes and make it more flexible via…
riley-harper Nov 20, 2024
2e58078
[#162] Fix a bug in RenameVectorAttributes
riley-harper Nov 20, 2024
8150ee5
[#162] Integrate RenameVectorAttributes to remove colons from Interac…
riley-harper Nov 20, 2024
b2dfa4e
[#162] Add an integration test for matching with LightGBM, and set th…
riley-harper Nov 20, 2024
444c6a7
[#162] Add hlink notice to the top of new files, add logging to Renam…
riley-harper Nov 20, 2024
aae00f6
[#161, #162] Merge branch 'add_xgboost' into add_new_ml_algs
riley-harper Nov 21, 2024
34b1a26
Merge branch 'main' into add_new_ml_algs
riley-harper Nov 21, 2024
7f7afe7
[#162] Integrate LightGBM with training step 3
riley-harper Nov 21, 2024
5ef0879
[#161, #162] Unify feature importances for XGBoost and LightGBM
riley-harper Nov 21, 2024
010f46a
[#161, #162] Refactor training step 3 to reduce duplication
riley-harper Nov 21, 2024
7864432
[#161, #162] Rename some variables and add logging in training step 3
riley-harper Nov 21, 2024
7dcc81d
[#162] Swap to using ADD JAR instead of spark.jars.packages
riley-harper Nov 22, 2024
6ae1f4e
[#162] Add lightgbm docs to sphinx-docs/models.md
riley-harper Nov 22, 2024
987f71c
[#161, #162] Match up the "missing module" error messages for xgboost…
riley-harper Nov 22, 2024
aeaef93
[#161, #162] Update the README with docs on xgboost and lightgbm
riley-harper Nov 25, 2024
35d527c
Remove .metals/ and ignore it
riley-harper Nov 25, 2024
96b0e0f
[#161, #162] Update the models.md Sphinx docs page
riley-harper Nov 25, 2024
ac3bb71
[#162] Get LightGBM to work with bucketized features
riley-harper Dec 3, 2024
c5bf26e
[#162] Require lightgbm for a new test, remove debugging output
riley-harper Dec 3, 2024
52d7721
Merge branch 'main' into add_new_ml_algs
riley-harper Dec 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,13 @@ jobs:
fail-fast: false
matrix:
python_version: ["3.10", "3.11", "3.12"]
hlink_extras: ["dev", "dev,lightgbm,xgboost"]
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Build the Docker image
run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }}
run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }} --build-arg HLINK_EXTRAS=${{ matrix.hlink_extras }}

- name: Check dependency versions
run: |
Expand All @@ -34,7 +35,7 @@ jobs:
run: docker run $HLINK_TAG-${{ matrix.python_version}} black --check .

- name: Test
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest -ra

- name: Build sdist and wheel
run: docker run $HLINK_TAG-${{ matrix.python_version}} python -m build
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ scala_jar/target
scala_jar/project/target
*.class
*.cache
.metals/

# MacOS
.DS_Store
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
ARG PYTHON_VERSION=3.10
FROM python:${PYTHON_VERSION}
ARG HLINK_EXTRAS=dev

RUN apt-get update && apt-get install default-jre-headless -y

Expand All @@ -8,4 +9,4 @@ WORKDIR /hlink

COPY . .
RUN python -m pip install --upgrade pip
RUN pip install -e .[dev]
RUN pip install -e .[${HLINK_EXTRAS}]
49 changes: 43 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,56 @@ We do our best to make hlink compatible with Python 3.10-3.12. If you have a
problem using hlink on one of these versions of Python, please open an issue
through GitHub. Versions of Python older than 3.10 are not supported.

Note that pyspark 3.5 does not yet officially support Python 3.12. If you
encounter pyspark-related import errors while running hlink on Python 3.12, try
Note that PySpark 3.5 does not yet officially support Python 3.12. If you
encounter PySpark-related import errors while running hlink on Python 3.12, try

- Installing the setuptools package. The distutils package was deleted from the
standard library in Python 3.12, but some versions of pyspark still import
standard library in Python 3.12, but some versions of PySpark still import
it. The setuptools package provides a hacky stand-in distutils library which
should fix some import errors in pyspark. We install setuptools in our
should fix some import errors in PySpark. We install setuptools in our
development and test dependencies so that our tests work on Python 3.12.

- Downgrading Python to 3.10 or 3.11. Pyspark officially supports these
versions of Python. So you should have better chances getting pyspark to work
- Downgrading Python to 3.10 or 3.11. PySpark officially supports these
versions of Python. So you should have better chances getting PySpark to work
well on Python 3.10 or 3.11.

### Additional Machine Learning Algorithms

hlink has optional support for two additional machine learning algorithms,
[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) and
[LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html). Both of these
algorithms are highly performant gradient boosting libraries, each with its own
characteristics. These algorithms are not implemented directly in Spark, so
they require some additional dependencies. To install the required Python
dependencies, run

```
pip install hlink[xgboost]
```

for XGBoost or

```
pip install hlink[lightgbm]
```

for LightGBM. If you would like to install both at once, you can run

```
pip install hlink[xgboost,lightgbm]
```

to get the Python dependencies for both. Both XGBoost and LightGBM also require
libomp, which will need to be installed separately if you don't already have it.

After installing the dependencies for one or both of these algorithms, you can
use them as model types in training and model exploration. You can read more
about these models in the hlink documentation [here](https://hlink.docs.ipums.org/models.html).

*Note: The XGBoost-PySpark integration provided by the xgboost Python package is
currently unstable. So the hlink xgboost support is experimental and may change
in the future.*

## Docs

The documentation site can be found at [hlink.docs.ipums.org](https://hlink.docs.ipums.org).
Expand Down
1 change: 1 addition & 0 deletions docs/_sources/model_exploration.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Configuring Model Exploration
195 changes: 152 additions & 43 deletions docs/_sources/models.md.txt
Original file line number Diff line number Diff line change
@@ -1,53 +1,80 @@
# Models

These are models available to be used in the model evaluation, training, and household training link tasks.

* Attributes for all models:
* `threshold` -- Type: `float`. Alpha threshold (model hyperparameter).
* `threshold_ratio` -- Type: `float`. Beta threshold (de-duplication distance ratio).
* Any parameters available in the model as defined in the Spark documentation can be passed as params using the label given in the Spark docs. Commonly used parameters are listed below with descriptive explanations from the Spark docs.
These are the machine learning models available for use in the model evaluation
and training tasks and in their household counterparts.

There are a few attributes available for all models.

* `type` -- Type: `string`. The name of the model type. The available model
types are listed below.
* `threshold` -- Type: `float`. The "alpha threshold". This is the probability
score required for a potential match to be labeled a match. `0 ≤ threshold ≤
1`.
* `threshold_ratio` -- Type: `float`. The threshold ratio or "beta threshold".
This applies to records which have multiple potential matches when
`training.decision` is set to `"drop_duplicate_with_threshold_ratio"`. For
each record, only potential matches which have the highest probability, have
a probability of at least `threshold`, *and* whose probabilities are at least
`threshold_ratio` times larger than the second-highest probability are
matches. This is sometimes called the "de-duplication distance ratio". `1 ≤
threshold_ratio < ∞`.

In addition, any model parameters documented in a model type's Spark
documentation can be passed as parameters to the model through hlink's
`training.chosen_model` and `training.model_exploration` configuration
sections.

Here is an example `training.chosen_model` configuration. The `type`,
`threshold`, and `threshold_ratio` attributes are hlink specific. `maxDepth` is
a parameter to the random forest model which hlink passes through to the
underlying Spark classifier.

```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.2
threshold_ratio = 1.2
maxDepth = 5
```

## random_forest

Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html). Returns probability as an array.
Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html).
* Parameters:
* `maxDepth` -- Type: `int`. Maximum depth of the tree. Spark default value is 5.
* `numTrees` -- Type: `int`. The number of trees to train. Spark default value is 20, must be >= 1.
* `featureSubsetStrategy` -- Type: `string`. Per the Spark docs: "The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]."

```
model_parameters = {
type = "random_forest",
maxDepth = 5,
numTrees = 75,
featureSubsetStrategy = "sqrt",
threshold = 0.15,
threshold_ratio = 1.0
}
```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.15
threshold_ratio = 1.0
maxDepth = 5
numTrees = 75
featureSubsetStrategy = "sqrt"
```

## probit

Uses [pyspark.ml.regression.GeneralizedLinearRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GeneralizedLinearRegression.html) with `family="binomial"` and `link="probit"`.

```
model_parameters = {
type = "probit",
threshold = 0.85,
threshold_ratio = 1.2
}
```toml
[training.chosen_model]
type = "probit"
threshold = 0.85
threshold_ratio = 1.2
```

## logistic_regression

Uses [pyspark.ml.classification.LogisticRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html)

```
chosen_model = {
type = "logistic_regression",
threshold = 0.5,
threshold_ratio = 1.0
}
```toml
[training.chosen_model]
type = "logistic_regression"
threshold = 0.5
threshold_ratio = 1.0
```

## decision_tree
Expand All @@ -59,13 +86,14 @@ Uses [pyspark.ml.classification.DecisionTreeClassifier](https://spark.apache.org
* `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
* `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."

```
chosen_model = {
type = "decision_tree",
maxDepth = 6,
minInstancesPerNode = 2,
maxBins = 4
}
```toml
[training.chosen_model]
type = "decision_tree"
threshold = 0.5
threshold_ratio = 1.5
maxDepth = 6
minInstancesPerNode = 2
maxBins = 4
```

## gradient_boosted_trees
Expand All @@ -77,13 +105,94 @@ Uses [pyspark.ml.classification.GBTClassifier](https://spark.apache.org/docs/lat
* `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
* `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."

```toml
[training.chosen_model]
type = "gradient_boosted_trees"
threshold = 0.7
threshold_ratio = 1.3
maxDepth = 4
minInstancesPerNode = 1
maxBins = 6
```

## xgboost

*Added in version 3.8.0.*

XGBoost is an alternate, high-performance implementation of gradient boosting.
It uses [xgboost.spark.SparkXGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.spark.SparkXGBClassifier).
Since the XGBoost-PySpark integration which the xgboost Python package provides
is currently unstable, support for the xgboost model type is disabled in hlink
by default. hlink will stop with an error if you try to use this model type
without enabling support for it. To enable support for xgboost, install hlink
with the `xgboost` extra.

```
chosen_model = {
type = "gradient_boosted_trees",
maxDepth = 4,
minInstancesPerNode = 1,
maxBins = 6,
threshold = 0.7,
threshold_ratio = 1.3
}
pip install hlink[xgboost]
```

This installs the xgboost package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of xgboost. xgboost should raise a helpful
error if it detects that you need to install libomp.

You can view a list of xgboost's parameters
[here](https://xgboost.readthedocs.io/en/latest/parameter.html).

```toml
[training.chosen_model]
type = "xgboost"
threshold = 0.8
threshold_ratio = 1.5
max_depth = 5
eta = 0.5
gamma = 0.05
```

## lightgbm

*Added in version 3.8.0.*

LightGBM is another alternate, high-performance implementation of gradient
boosting. It uses
[synapse.ml.lightgbm.LightGBMClassifier](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
`synapse.ml` is a library which provides various integrations with PySpark,
including integrations between the C++ LightGBM library and PySpark.

LightGBM requires some additional Scala libraries that hlink does not usually
install, so support for the lightgbm model is disabled in hlink by default.
hlink will stop with an error if you try to use this model type without
enabling support for it. To enable support for lightgbm, install hlink with the
`lightgbm` extra.

```
pip install hlink[lightgbm]
```

This installs the lightgbm package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of lightgbm. If you encounter errors when
training a lightgbm model, please try installing libomp if you do not have it
installed.

lightgbm has an enormous number of available parameters. Many of these are
available as normal in hlink, via the [LightGBMClassifier
class](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
Others are available through the special `passThroughArgs` parameter, which
passes additional parameters through to the C++ library. You can see a full
list of the supported parameters
[here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).

```toml
[training.chosen_model]
type = "lightgbm"
# hlink's threshold and threshold_ratio
threshold = 0.8
threshold_ratio = 1.5
# LightGBMClassifier supports these parameters (and many more).
maxDepth = 5
learningRate = 0.5
# LightGBMClassifier does not directly support this parameter,
# so we have to send it to the C++ library with passThroughArgs.
passThroughArgs = "force_row_wise=true"
```
1 change: 1 addition & 0 deletions docs/column_mappings.html
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/comparison_features.html
Original file line number Diff line number Diff line change
Expand Up @@ -1301,6 +1301,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/comparisons.html
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/config.html
Original file line number Diff line number Diff line change
Expand Up @@ -958,6 +958,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
1 change: 1 addition & 0 deletions docs/feature_selection_transforms.html
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,7 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>
<li class="toctree-l1"><a class="reference internal" href="pipeline_features.html">Pipeline Features</a></li>
<li class="toctree-l1"><a class="reference internal" href="substitutions.html">Substitutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="models.html">Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="model_exploration.html">Model Exploration</a></li>
</ul>

<div class="relations">
Expand Down
2 changes: 2 additions & 0 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,8 @@ <h1>Configuration API<a class="headerlink" href="#configuration-api" title="Link
<li class="toctree-l2"><a class="reference internal" href="models.html#logistic-regression">logistic_regression</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#decision-tree">decision_tree</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#gradient-boosted-trees">gradient_boosted_trees</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#xgboost">xgboost</a></li>
<li class="toctree-l2"><a class="reference internal" href="models.html#lightgbm">lightgbm</a></li>
</ul>
</li>
</ul>
Expand Down
Loading
Loading