Skip to content

use cloudpickle to serialize expressions held in an estimator + skip eager pickling checks #1324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jeromedockes
Copy link
Member

one possible fix for #1323

@@ -420,7 +420,8 @@ def __get__(self, instance, owner=None):
return f"""Skrub expression.\nDocstring of the preview:\n{doc}"""


class Expr:
class Expr(_CloudPickle):
_cloudpickle_attributes = ["_skrub_impl"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is _skrub_impl guaranteed to not hold a reference to a large datastructure? Calling cloudpickle.dumps would be problematic otherwise (by allocating a large intermediate datastructure before serializing to disk/network).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that's true. no it isn't and in fact it is quite likely to hold references to dataframes that users passed when initializing a variable

all in all I think we might better off removing the pickle-ability checks, and any attempt to force usage of cloudpickle, and just recommend in the documentation that users rely on cloudpickle if pickle cannot handle their pipeline (eg due to lambda expressions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe we could have that strategy on the estimators returned by get_estimator, get_grid_search etc instead because those don't have references to data, and they are more likely to be what one wants to serialize anyway after fitting

@jeromedockes jeromedockes changed the title use cloudpickle to serialize expressions' internal data use cloudpickle to serialize expressions held in an estimator Apr 15, 2025
@jeromedockes jeromedockes added no changelog needed expressions Something related to the skrub expressions labels Apr 15, 2025
@jeromedockes jeromedockes marked this pull request as ready for review April 15, 2025 15:33
@jeromedockes jeromedockes changed the title use cloudpickle to serialize expressions held in an estimator use cloudpickle to serialize expressions held in an estimator + skip eager pickling checks Apr 15, 2025
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding of the convo and the current limitations of the pickling mechanism, it LGTM –I haven't tested thoroughly, yet.

@Vincent-Maladiere Vincent-Maladiere merged commit 5f68b95 into skrub-data:main Apr 16, 2025
27 of 29 checks passed
@jeromedockes jeromedockes deleted the cloudpickle-expressions branch April 16, 2025 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Something related to the skrub expressions no changelog needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants