-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the LightGBM ML library #162
Comments
We'll need to install the
At first this caused an error when I tried to create the Spark context. But after searching around for a solution, I cleaned out .ivy2 and .m2 in my home directory and it ran without issues. These additional configurations should probably be dependent on
|
To get feature importances in training: https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#synapse.ml.lightgbm.mixin.LightGBMModelMixin.getFeatureImportances |
This installs SynapseML, Microsoft's Apache Spark integrations package. It has a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.
I tried manually running training with LightGBM and got this:
Where do we construct those slot names? |
One of these is failing because there's a bug where LightGBM throws an error on interacted features.
Usually we don't care about the names of the vector attributes. But LightGBM uses them as feature names and disallows some characters in the names. Unfortunately, one of these characters is :, and Spark's Interaction names the output of an interaction between A and B "A:B". I looked through the Spark code and didn't see any way to configure the names of these output features. So I think the easiest way forward here is to make a transformer that renames the attributes of a vector by removing some characters and replacing them with another.
The bug was that we didn't propagate the metadata changes into Java, so they weren't persistent in something like a Pipeline. By calling withMetadata(), we should now be persisting our changes correctly.
This should hopefully let executors find the jars as well as the driver. I've added some comments because this is a bit gnarly.
The Spark Bucketizer adds commas to vector slot names, which cause problems with LightGBM later in the pipeline. This is similar to the issue with colons for Interaction, but the metadata for bucketized vectors is a little bit different. So RenameVectorAttributes needed to change a bit to handle the two different forms of metadata.
Generally clean up some small mistakes. I also added a comment to the logic that removes the commas in core/pipeline.py.
In addition to XGBoost (#161), we would also like to add support for LightGBM. This should work similarly to XGBoost, since we'd also like to make LightGBM opt-in. From the documentation, it sounds like we'll need the SynapseML package to be able to run LightGBM on Spark.
To Do List
The text was updated successfully, but these errors were encountered: