[Autoscaler] Issue with RayCluster autoscaler and hanging non existent tasks #48950

inf000 · 2024-11-26T21:25:08Z

What happened + What you expected to happen

Greetings,

We have been experiencing issues with RayCluster and it's autoscaler. We have a deployment of RayCluster on a k8s cluster and are using spot instances for workers and a persistent node for the head. We are always scaling out from 0 for the workers when new tasks arrive and scale back to 0 once everything is done. This has been working for us for some time now, but recently we switched from Python 3.8.9 and Ray 2.9.3 to Python 3.11.10 and Ray 2.38.0 and we have been experiencing some really problematic issues, mainly at random we get some pending tasks that need to be scheduled but no jobs are running. This is causing the autoscaler to keep the workers always on even though the workers are doing nothing. The workers are shutting down after the idle timeout period but the autoscaler is provisioning them back because it has some stuck tasks.

Do you have any idea why this is happening and how to fix or workaround this issue since it's impacting our stack significantly.

I'll attach some images, and as you can see we don't have any actors, and there are no jobs that are not finished.

Versions / Dependencies

Python 3.11.10
Ray 2.38.0
Docker image rayproject/ray:2.38.0-py311

Reproduction script

We can't pinpoint when or how the issue happens, we suspect it happens when a job fails at some point either during it's run or as it is instantiated.

Issue Severity

High: It blocks me from completing my task.

Moonquakes · 2024-11-27T00:58:30Z

I have encountered the same problem, but I still can't provide a reproducible code. Maybe the probability of stopping the job immediately after submitting a large number of ray tasks will be higher.

Laleee · 2024-11-27T08:53:53Z

Same here.... Not jet sure how can I reproduce it... It has happened several times.

inf000 · 2024-11-27T11:43:32Z

Ok a little potential progress on this, we know how it got to those dangling tasks, and discovered a possible issue with runtime environment and GCS.
So our setup was like this, we are triggering couple jobs one after the other, each job instantiates with ray.init, we create a custom runtime environment by sending the pip dependencies and 2 local packages that are sent via py_modules. They are zipped and sent to GCS and it stores them and adds them to the URI table, and URI reference table. The jobs runs everything, finishes, we shut down the connection, the reference count goes to 0 for the URIs and the tables are cleared. At some random point (could be the second, third or n-th job) we instantiate with ray.init, the packages are sent, we can see by the logs that the packages are cached (e.g. Runtime env py_modules gcs://_ray_pkg_53653ff9753e2d43.zip is already installed and will be reused), the init finishes and we start pushing tasks with ray.remote, and at some point while doing that ray throws an OSError (
OSError: Failed to download runtime_env file package gcs://_ray_pkg_53653ff9753e2d43.zip from the GCS to the Ray worker node) and does so on the download_and_unpack_package funciton, specifically the internal_kv_get method. At that point the tasks that were scheduled are left dangling because the job was not created for them, or was but failed, we are not sure.. And this GCS failure is bizarre because we are running these jobs and the are using the same cached packages, and it's not like they are getting deleted because most of the time we get cache hits it's working fine, on multiple subsequent runs.. The packages are small, sub 1MB, and we event tried increasing the RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S env flag, even though we know it's not that because our setup takes a couple of seconds, far bellow the 600s default value..
One point that we noticed by cross referencing logs and timestamps is that when runtime_env_setup does a cache hit the gcs servers URI table is empty, and after a slight delay you get those packages in the table, and what seems to happen is a worker asking for that package in that slight time period between those 2 events (this is all speculation)..

Hope this is of any help in recreating or solving this issue

inf000 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 26, 2024

jcotant1 added the core Issues that should be addressed in Ray Core label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler] Issue with RayCluster autoscaler and hanging non existent tasks #48950

[Autoscaler] Issue with RayCluster autoscaler and hanging non existent tasks #48950

inf000 commented Nov 26, 2024

Moonquakes commented Nov 27, 2024

Laleee commented Nov 27, 2024

inf000 commented Nov 27, 2024

[Autoscaler] Issue with RayCluster autoscaler and hanging non existent tasks #48950

[Autoscaler] Issue with RayCluster autoscaler and hanging non existent tasks #48950

Comments

inf000 commented Nov 26, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Moonquakes commented Nov 27, 2024

Laleee commented Nov 27, 2024

inf000 commented Nov 27, 2024