Skip to content

Conversation

@gaogaotiantian
Copy link
Contributor

@gaogaotiantian gaogaotiantian commented Nov 22, 2025

What changes were proposed in this pull request?

We lazy-import the worker module after fork to avoid potential deadlock caused by importing some modules that spawns multiple threads.

Why are the changes needed?

https://discuss.python.org/t/switching-default-multiprocessing-context-to-spawn-on-posix-as-well/21868

It's impossible to do a thread-safe fork in CPython. CPython started issuing warnings from 3.12 and switched the default multiprocessing start method to "spawn" since 3.14.

It would be a huge effort for us to give up fork entirely, but we can try out best to not import random modules before fork by lazy-importing worker module after fork.

We already have some workers that import dangerous libraries like pyarrow - plan_data_source_read for example.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI should pass.

Was this patch authored or co-authored using generative AI tooling?

No

@HyukjinKwon HyukjinKwon changed the title [SPARK-54456][Python] Import worker module after fork to avoid deadlock [SPARK-54456][PYTHON] Import worker module after fork to avoid deadlock Nov 23, 2025
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants