Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fails with Dask 2024.11.0+ #10994

Open
hcho3 opened this issue Nov 12, 2024 · 4 comments
Open

Test fails with Dask 2024.11.0+ #10994

hcho3 opened this issue Nov 12, 2024 · 4 comments

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Nov 12, 2024

https://github.com/dmlc/xgboost/actions/runs/11753771153/job/32747003155

E                       distributed.client.FutureCancelledError: ('_argmax-06657a445bd2e0d811c6ff48d5860817', 24) cancelled for reason: unknown.

../../mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/client.py:2427: FutureCancelledError
-------------------------------------------------- Captured log setup --------------------------------------------------
INFO     distributed.http.proxy:proxy.py:85 To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO     distributed.scheduler:scheduler.py:1755 State start
INFO     distributed.scheduler:scheduler.py:4225   Scheduler at:     tcp://127.0.0.1:45365
INFO     distributed.scheduler:scheduler.py:4240   dashboard at:  http://127.0.0.1:35259/status
INFO     distributed.scheduler:scheduler.py:8115 Registering Worker plugin shuffle
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:44537'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:32881'
INFO     distributed.scheduler:scheduler.py:4579 Register worker <WorkerState 'tcp://127.0.0.1:41799', name: 0, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:6171 Starting worker compute stream, tcp://127.0.0.1:41799
INFO     distributed.core:core.py:883 Starting established connection to tcp://127.0.0.1:51116
INFO     distributed.scheduler:scheduler.py:4579 Register worker <WorkerState 'tcp://127.0.0.1:42755', name: 1, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:6171 Starting worker compute stream, tcp://127.0.0.1:42755
INFO     distributed.core:core.py:883 Starting established connection to tcp://127.0.0.1:51122
INFO     distributed.scheduler:scheduler.py:5925 Receive client connection: Client-1b5e6d99-a113-11ef-bd45-00c0cab020ae
INFO     distributed.core:core.py:883 Starting established connection to tcp://127.0.0.1:51132
-------------------------------------------------- Captured log call ---------------------------------------------------
INFO     distributed.worker:worker.py:3171 Run out-of-band function '_start_tracker'
INFO     distributed.scheduler:scheduler.py:5925 Receive client connection: Client-worker-1c041fdc-a113-11ef-bd59-00c0cab020ae
INFO     distributed.core:core.py:883 Starting established connection to tcp://127.0.0.1:51160
INFO     distributed.scheduler:scheduler.py:5925 Receive client connection: Client-worker-1c049a95-a113-11ef-bd5c-00c0cab020ae
INFO     distributed.core:core.py:883 Starting established connection to tcp://127.0.0.1:51200
ERROR    distributed.scheduler:scheduler.py:4956 No keys provided
ERROR    distributed.scheduler:scheduler.py:4956 No keys provided
ERROR    distributed.scheduler:scheduler.py:4956 No keys provided
INFO     distributed.scheduler:scheduler.py:4637 User asked for computation on lost data, ('_argmax-06657a445bd2e0d811c6ff48d5860817', 24)
ERROR    distributed.scheduler:scheduler.py:2091 Error transitioning 'Booster-86632f99bd6a485787c42e9e625f998e' from 'waiting' to 'processing'
Traceback (most recent call last):
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 2010, in _transition
    recommendations, client_msgs, worker_msgs = func(
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 2461, in _transition_waiting_processing
    return self._add_to_processing(ts, ws, stimulus_id=stimulus_id)
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 3407, in _add_to_processing
    return {}, {}, {ws.address: [self._task_to_msg(ts)]}
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 3579, in _task_to_msg
    assert ts.priority, ts
AssertionError: <TaskState 'Booster-86632f99bd6a485787c42e9e625f998e' processing>
ERROR    distributed.scheduler:scheduler.py:4956 <TaskState 'Booster-86632f99bd6a485787c42e9e625f998e' processing>
ERROR    distributed.protocol.pickle:pickle.py:79 Failed to serialize <TaskState 'Booster-86632f99bd6a485787c42e9e625f998e' processing>.
Traceback (most recent call last):
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 4911, in update_graph
    metrics = self._create_taskstate_from_graph(
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 4788, in _create_taskstate_from_graph
    self.transitions(recommendations, stimulus_id)
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 8227, in transitions
    self._transitions(recommendations, client_msgs, worker_msgs, stimulus_id)
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 2127, in _transitions
    new_recs, new_cmsgs, new_wmsgs = self._transition(key, finish, stimulus_id)
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 2010, in _transition
    recommendations, client_msgs, worker_msgs = func(
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 2461, in _transition_waiting_processing
    return self._add_to_processing(ts, ws, stimulus_id=stimulus_id)
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 3407, in _add_to_processing
    return {}, {}, {ws.address: [self._task_to_msg(ts)]}
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/scheduler.py", line 3579, in _task_to_msg
    assert ts.priority, ts
AssertionError: <TaskState 'Booster-86632f99bd6a485787c42e9e625f998e' processing>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 60, in dumps
    result = pickle.dumps(x, **dump_kwargs)
AttributeError: Can't pickle local object '_inplace_predict_async.<locals>.mapped_predict'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 65, in dumps
    pickler.dump(x)
AttributeError: Can't pickle local object '_inplace_predict_async.<locals>.mapped_predict'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 77, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1529, in dumps
    cp.dump(obj)
  File "/home/phcho/mambaforge/envs/linux_cpu_test/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1295, in dump
    return super().dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object
@trivialfis
Copy link
Member

Looking at the error, I can't be sure how to create an example for dask to debug it.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 13, 2024

Were you able to reproduce the failure in the XGBoost pytest?

@trivialfis
Copy link
Member

Yes, on my local machine with the latest dask/distributed, running only the classification tests.

$ pytest -s -v ./tests/test_distributed/test_gpu_with_dask/ -k test_dask_classifier

@trivialfis
Copy link
Member

Dask is getting flakier with the new dask-expr and the new shuffle engine, might take some time to debug these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants