Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the LightGBM ML library #162

Closed
9 of 10 tasks
riley-harper opened this issue Nov 19, 2024 · 3 comments · Fixed by #165
Closed
9 of 10 tasks

Add the LightGBM ML library #162

riley-harper opened this issue Nov 19, 2024 · 3 comments · Fixed by #165
Labels
type: feature A new feature or enhancement to a feature

Comments

@riley-harper
Copy link
Contributor

riley-harper commented Nov 19, 2024

In addition to XGBoost (#161), we would also like to add support for LightGBM. This should work similarly to XGBoost, since we'd also like to make LightGBM opt-in. From the documentation, it sounds like we'll need the SynapseML package to be able to run LightGBM on Spark.

To Do List

  • Create an hlink extra which installs LightGBM and its dependencies
  • Integrate LightGBM into core/classifier.py
  • Make sure LightGBM works with model_exploration
  • Make sure LightGBM works with training (especially step 3, save model metadata)
  • Make sure LightGBM works with matching
  • Document the new pip extra and any additional requirements for LightGBM (LightGBM also requires libomp)
  • Document the new model type in the Sphinx docs
  • Make a nice, informative error message for when LightGBM isn't available
  • Update the tests to install LightGBM for one of the matrix entries and not the other
  • Try to make Spark print less about the synapseml installation (can we silence this or send it to the log instead of the screen?)
@riley-harper
Copy link
Contributor Author

riley-harper commented Nov 19, 2024

We'll need to install the synapseml Python package, which you can import as synapse.ml. synapse.ml.lightgbm.LightGBMClassifier seems to be the class that we need for Spark integration. Part of the setup for SynapseML includes downloading additional Spark jars. I added a few lines to hlink.spark.session in set_conf():

         if os.path.isfile(jar_path):
             conf = conf.set("spark.jars", jar_path)
+
+        conf.set("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.8")
+        conf.set("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
+
         return conf

     def local(self, cores=1, executor_memory="10G"):

At first this caused an error when I tried to create the Spark context. But after searching around for a solution, I cleaned out .ivy2 and .m2 in my home directory and it ran without issues. These additional configurations should probably be dependent on synapse.ml being installed, so that users who aren't using LightGBM don't have to download them.

try:
    import synapse.ml
except ModuleNotFoundError:
    _synapse_ml_available = False
else:
    _synapse_ml_available = True

...

if _synapse_ml_available:
    conf.set("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.8")
    conf.set("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")

@riley-harper
Copy link
Contributor Author

riley-harper added a commit that referenced this issue Nov 19, 2024
This installs SynapseML, Microsoft's Apache Spark integrations package. It has
a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.
riley-harper added a commit that referenced this issue Nov 19, 2024
riley-harper added a commit that referenced this issue Nov 19, 2024
@riley-harper
Copy link
Contributor Author

I tried manually running training with LightGBM and got this:

An error occured:
ERROR type: <class 'pyspark.errors.exceptions.captured.IllegalArgumentException'>
ERROR message: Invalid slot names detected in features column: jw_interacted_namefrst_jw_imp:namelast_jw_imp
 Special characters " , : \ [ ] { } will cause unexpected behavior in LGBM unless changed. This error can be fixed by renaming the problematic columns prior to vector assembly.

Where do we construct those slot names?

riley-harper added a commit that referenced this issue Nov 19, 2024
One of these is failing because there's a bug where LightGBM throws an error on
interacted features.
riley-harper added a commit that referenced this issue Nov 20, 2024
Usually we don't care about the names of the vector attributes. But LightGBM
uses them as feature names and disallows some characters in the names.
Unfortunately, one of these characters is :, and Spark's Interaction names the
output of an interaction between A and B "A:B". I looked through the Spark code
and didn't see any way to configure the names of these output features. So I
think the easiest way forward here is to make a transformer that renames the
attributes of a vector by removing some characters and replacing them with
another.
riley-harper added a commit that referenced this issue Nov 20, 2024
The bug was that we didn't propagate the metadata changes into Java, so they
weren't persistent in something like a Pipeline. By calling withMetadata(), we
should now be persisting our changes correctly.
riley-harper added a commit that referenced this issue Nov 20, 2024
riley-harper added a commit that referenced this issue Nov 21, 2024
This merges the xgboost and lightgbm branches together. There were several
files with conflicts. Most of the conflicts I resolved by keeping the work from
both branches.
riley-harper added a commit that referenced this issue Nov 21, 2024
We now compute two feature importances for each model.

- weight: the number of splits that each feature causes
- gain: the total gain across all of each feature's splits
riley-harper added a commit that referenced this issue Nov 21, 2024
I'm still not entirely happy with this, but it's a tricky point in the code
because most of the models behave one way, but xgboost and lightgbm are
different. Some more refactoring might be in order.
riley-harper added a commit that referenced this issue Nov 22, 2024
This should hopefully let executors find the jars as well as the driver. I've
added some comments because this is a bit gnarly.
riley-harper added a commit that referenced this issue Nov 25, 2024
- It turns out that multi-line TOML tables aren't allowed. So let's use the
  [training.chosen_model] syntax instead.
- I clarified the introductory information and made it general enough to apply
  to XGBoost and LightGBM as well.
riley-harper added a commit that referenced this issue Dec 3, 2024
The Spark Bucketizer adds commas to vector slot names, which cause
problems with LightGBM later in the pipeline. This is similar to the
issue with colons for Interaction, but the metadata for bucketized
vectors is a little bit different. So RenameVectorAttributes needed to
change a bit to handle the two different forms of metadata.
riley-harper added a commit that referenced this issue Dec 3, 2024
Generally clean up some small mistakes. I also added a comment to the
logic that removes the commas in core/pipeline.py.
@riley-harper riley-harper added type: feature A new feature or enhancement to a feature and removed enhancement labels Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature or enhancement to a feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant