Add the LightGBM ML library #162

riley-harper · 2024-11-19T14:49:43Z

riley-harper · 2024-11-19T15:38:02Z

We'll need to install the synapseml Python package, which you can import as synapse.ml. synapse.ml.lightgbm.LightGBMClassifier seems to be the class that we need for Spark integration. Part of the setup for SynapseML includes downloading additional Spark jars. I added a few lines to hlink.spark.session in set_conf():

         if os.path.isfile(jar_path):
             conf = conf.set("spark.jars", jar_path)
+
+        conf.set("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.8")
+        conf.set("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
+
         return conf

     def local(self, cores=1, executor_memory="10G"):

At first this caused an error when I tried to create the Spark context. But after searching around for a solution, I cleaned out .ivy2 and .m2 in my home directory and it ran without issues. These additional configurations should probably be dependent on synapse.ml being installed, so that users who aren't using LightGBM don't have to download them.

try:
    import synapse.ml
except ModuleNotFoundError:
    _synapse_ml_available = False
else:
    _synapse_ml_available = True

...

if _synapse_ml_available:
    conf.set("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.8")
    conf.set("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")

riley-harper · 2024-11-19T17:29:14Z

To get feature importances in training: https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#synapse.ml.lightgbm.mixin.LightGBMModelMixin.getFeatureImportances

This installs SynapseML, Microsoft's Apache Spark integrations package. It has a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.

This is currently failing.

riley-harper · 2024-11-19T19:03:51Z

I tried manually running training with LightGBM and got this:

An error occured:
ERROR type: <class 'pyspark.errors.exceptions.captured.IllegalArgumentException'>
ERROR message: Invalid slot names detected in features column: jw_interacted_namefrst_jw_imp:namelast_jw_imp
 Special characters " , : \ [ ] { } will cause unexpected behavior in LGBM unless changed. This error can be fixed by renaming the problematic columns prior to vector assembly.

Where do we construct those slot names?

One of these is failing because there's a bug where LightGBM throws an error on interacted features.

Usually we don't care about the names of the vector attributes. But LightGBM uses them as feature names and disallows some characters in the names. Unfortunately, one of these characters is :, and Spark's Interaction names the output of an interaction between A and B "A:B". I looked through the Spark code and didn't see any way to configure the names of these output features. So I think the easiest way forward here is to make a transformer that renames the attributes of a vector by removing some characters and replacing them with another.

… some extra params

The bug was that we didn't propagate the metadata changes into Java, so they weren't persistent in something like a Pipeline. By calling withMetadata(), we should now be persisting our changes correctly.

…tion output for LightGBM

…e post-transformer

…eVectorAttributes

This merges the xgboost and lightgbm branches together. There were several files with conflicts. Most of the conflicts I resolved by keeping the work from both branches.

We now compute two feature importances for each model. - weight: the number of splits that each feature causes - gain: the total gain across all of each feature's splits

I'm still not entirely happy with this, but it's a tricky point in the code because most of the models behave one way, but xgboost and lightgbm are different. Some more refactoring might be in order.

This should hopefully let executors find the jars as well as the driver. I've added some comments because this is a bit gnarly.

… and lightgbm

- It turns out that multi-line TOML tables aren't allowed. So let's use the [training.chosen_model] syntax instead. - I clarified the introductory information and made it general enough to apply to XGBoost and LightGBM as well.

The Spark Bucketizer adds commas to vector slot names, which cause problems with LightGBM later in the pipeline. This is similar to the issue with colons for Interaction, but the metadata for bucketized vectors is a little bit different. So RenameVectorAttributes needed to change a bit to handle the two different forms of metadata.

Generally clean up some small mistakes. I also added a comment to the logic that removes the commas in core/pipeline.py.

riley-harper added the enhancement label Nov 19, 2024

riley-harper added a commit that referenced this issue Nov 19, 2024

[#162] Create a lightgbm hlink extra

59033b2

This installs SynapseML, Microsoft's Apache Spark integrations package. It has a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.

riley-harper added a commit that referenced this issue Nov 19, 2024

[#162] Create a test for choose_classifier() support for lightgbm

88956ec

This is currently failing.

riley-harper added a commit that referenced this issue Nov 19, 2024

[#162] Allow model_type lightgbm in choose_classifier()

dcafbc0

riley-harper added a commit that referenced this issue Nov 19, 2024

[#162] Fix a flake8 error

83f6b5c

riley-harper added a commit that referenced this issue Nov 19, 2024

[#162] Run CI/CD once with lightgbm and once without

1aef721

riley-harper added a commit that referenced this issue Nov 19, 2024

[#162] Add two training tests for lightgbm

72fd83c

One of these is failing because there's a bug where LightGBM throws an error on interacted features.

riley-harper added a commit that referenced this issue Nov 20, 2024

[#162] Implement basic RenameVectorAttributes logic

7c34bab

riley-harper added a commit that referenced this issue Nov 20, 2024

[#162] Implement RenameVectorAttributes and make it more flexible via…

a4f3534

… some extra params

riley-harper added a commit that referenced this issue Nov 20, 2024

[#162] Integrate RenameVectorAttributes to remove colons from Interac…

8150ee5

…tion output for LightGBM

riley-harper added a commit that referenced this issue Nov 20, 2024

[#162] Add an integration test for matching with LightGBM, and set th…

b2dfa4e

…e post-transformer

riley-harper added a commit that referenced this issue Nov 20, 2024

[#162] Add hlink notice to the top of new files, add logging to Renam…

444c6a7

…eVectorAttributes

riley-harper added a commit that referenced this issue Nov 21, 2024

[#162] Integrate LightGBM with training step 3

7f7afe7

riley-harper added a commit that referenced this issue Nov 21, 2024

[#161, #162] Rename some variables and add logging in training step 3

7864432

riley-harper added a commit that referenced this issue Nov 22, 2024

[#162] Swap to using ADD JAR instead of spark.jars.packages

7dcc81d

This should hopefully let executors find the jars as well as the driver. I've added some comments because this is a bit gnarly.

riley-harper added a commit that referenced this issue Nov 22, 2024

[#162] Add lightgbm docs to sphinx-docs/models.md

6ae1f4e

riley-harper added a commit that referenced this issue Nov 22, 2024

[#161, #162] Match up the "missing module" error messages for xgboost…

987f71c

… and lightgbm

riley-harper added a commit that referenced this issue Nov 25, 2024

[#161, #162] Update the README with docs on xgboost and lightgbm

aeaef93

riley-harper mentioned this issue Nov 25, 2024

Add support for XGBoost and LightGBM #165

Merged

3 tasks

riley-harper added a commit that referenced this issue Dec 3, 2024

[#162] Require lightgbm for a new test, remove debugging output

c5bf26e

Generally clean up some small mistakes. I also added a comment to the logic that removes the commas in core/pipeline.py.

riley-harper added type: feature A new feature or enhancement to a feature and removed enhancement labels Dec 4, 2024

riley-harper closed this as completed in #165 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the LightGBM ML library #162

Add the LightGBM ML library #162

riley-harper commented Nov 19, 2024 •

edited

Loading

riley-harper commented Nov 19, 2024 •

edited

Loading

riley-harper commented Nov 19, 2024

riley-harper commented Nov 19, 2024

Add the LightGBM ML library #162

Add the LightGBM ML library #162

Comments

riley-harper commented Nov 19, 2024 • edited Loading

riley-harper commented Nov 19, 2024 • edited Loading

riley-harper commented Nov 19, 2024

riley-harper commented Nov 19, 2024

riley-harper commented Nov 19, 2024 •

edited

Loading

riley-harper commented Nov 19, 2024 •

edited

Loading