-
Notifications
You must be signed in to change notification settings - Fork 120
use cloudpickle to serialize expressions held in an estimator + skip eager pickling checks #1324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use cloudpickle to serialize expressions held in an estimator + skip eager pickling checks #1324
Conversation
skrub/_expressions/_expressions.py
Outdated
@@ -420,7 +420,8 @@ def __get__(self, instance, owner=None): | |||
return f"""Skrub expression.\nDocstring of the preview:\n{doc}""" | |||
|
|||
|
|||
class Expr: | |||
class Expr(_CloudPickle): | |||
_cloudpickle_attributes = ["_skrub_impl"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is _skrub_impl
guaranteed to not hold a reference to a large datastructure? Calling cloudpickle.dumps
would be problematic otherwise (by allocating a large intermediate datastructure before serializing to disk/network).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, that's true. no it isn't and in fact it is quite likely to hold references to dataframes that users passed when initializing a variable
all in all I think we might better off removing the pickle-ability checks, and any attempt to force usage of cloudpickle, and just recommend in the documentation that users rely on cloudpickle if pickle cannot handle their pipeline (eg due to lambda expressions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe we could have that strategy on the estimators returned by get_estimator
, get_grid_search
etc instead because those don't have references to data, and they are more likely to be what one wants to serialize anyway after fitting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding of the convo and the current limitations of the pickling mechanism, it LGTM –I haven't tested thoroughly, yet.
one possible fix for #1323