-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] pyiceberg hanging on multiprocessing #1488
Comments
Hi @frankliee thanks for reporting this issue. I noticed you're using version 0.7.1, the latest version is 0.8.1. Could you retry with the latest version? The issue might be with due to the ability to pickle Table object #513
it would be helpful if you have a stacktrace to show where the hang happens. |
@kevinjqliu
|
i suspect this is due to how we cache iceberg-python/pyiceberg/io/pyarrow.py Lines 335 to 340 in e39f91a
I dont think this is safe in multi-process env. In the second example, iceberg-python/pyiceberg/catalog/__init__.py Lines 716 to 717 in e39f91a
|
One thing we can test is to force create a new FileIO in the worker.
Can you try this out? |
I have tried to renew PyArrowFileIO in the worker subprocess, but it is still blocked on By the way, I have add some log to check that
|
I use strace on the worker process, there are Then I find that using "spawn" process could avoid hanging. from multiprocessing import Process
from pyiceberg.io.pyarrow import PyArrowFileIO
import multiprocessing as mp
worker_num = 2
def worker(tbl):
tbl.io = PyArrowFileIO(tbl.properties)
arr = tbl.scan().to_arrow()
print(arr)
if __name__ == "__main__":
ctx = mp.get_context("spawn")
from pyiceberg.catalog import load_catalog
catalog = load_catalog("mycatalog")
tbl = catalog.load_table("db.table")
workers = [ctx.Process(target=worker, args=(tbl, )) for worker_id in range(worker_num)]
[p.start() for p in workers]
[p.join() for p in workers] |
Glad you were able to find a working solution. In general, the FileIO is loaded/retrieve from the catalog object
Arrow has an issue related to Also from multiprocessing docs
It sounds like in the case of |
When we use |
Apache Iceberg version
0.7.1
Please describe the bug 🐞
The bad code: load table in the sub process.
The arr will never be printed due to process hanging
the good code: load table in the main process
The only difference is where to load the table.
Willingness to contribute
The text was updated successfully, but these errors were encountered: