Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaler] Issue with RayCluster autoscaler and hanging non existent tasks #48950

Open
inf000 opened this issue Nov 26, 2024 · 3 comments
Open
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@inf000
Copy link

inf000 commented Nov 26, 2024

What happened + What you expected to happen

Greetings,

We have been experiencing issues with RayCluster and it's autoscaler. We have a deployment of RayCluster on a k8s cluster and are using spot instances for workers and a persistent node for the head. We are always scaling out from 0 for the workers when new tasks arrive and scale back to 0 once everything is done. This has been working for us for some time now, but recently we switched from Python 3.8.9 and Ray 2.9.3 to Python 3.11.10 and Ray 2.38.0 and we have been experiencing some really problematic issues, mainly at random we get some pending tasks that need to be scheduled but no jobs are running. This is causing the autoscaler to keep the workers always on even though the workers are doing nothing. The workers are shutting down after the idle timeout period but the autoscaler is provisioning them back because it has some stuck tasks.

Do you have any idea why this is happening and how to fix or workaround this issue since it's impacting our stack significantly.

I'll attach some images, and as you can see we don't have any actors, and there are no jobs that are not finished.

Screenshot 2024-11-26 at 21 56 04 Screenshot 2024-11-26 at 21 56 18 Screenshot 2024-11-26 at 21 56 29

Versions / Dependencies

Python 3.11.10
Ray 2.38.0
Docker image rayproject/ray:2.38.0-py311

Reproduction script

We can't pinpoint when or how the issue happens, we suspect it happens when a job fails at some point either during it's run or as it is instantiated.

Issue Severity

High: It blocks me from completing my task.

@inf000 inf000 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 26, 2024
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Nov 26, 2024
@Moonquakes
Copy link

I have encountered the same problem, but I still can't provide a reproducible code. Maybe the probability of stopping the job immediately after submitting a large number of ray tasks will be higher.

@Laleee
Copy link

Laleee commented Nov 27, 2024

Same here.... Not jet sure how can I reproduce it... It has happened several times.

@inf000
Copy link
Author

inf000 commented Nov 27, 2024

Ok a little potential progress on this, we know how it got to those dangling tasks, and discovered a possible issue with runtime environment and GCS.
So our setup was like this, we are triggering couple jobs one after the other, each job instantiates with ray.init, we create a custom runtime environment by sending the pip dependencies and 2 local packages that are sent via py_modules. They are zipped and sent to GCS and it stores them and adds them to the URI table, and URI reference table. The jobs runs everything, finishes, we shut down the connection, the reference count goes to 0 for the URIs and the tables are cleared. At some random point (could be the second, third or n-th job) we instantiate with ray.init, the packages are sent, we can see by the logs that the packages are cached (e.g. Runtime env py_modules gcs://_ray_pkg_53653ff9753e2d43.zip is already installed and will be reused), the init finishes and we start pushing tasks with ray.remote, and at some point while doing that ray throws an OSError (
OSError: Failed to download runtime_env file package gcs://_ray_pkg_53653ff9753e2d43.zip from the GCS to the Ray worker node) and does so on the download_and_unpack_package funciton, specifically the internal_kv_get method. At that point the tasks that were scheduled are left dangling because the job was not created for them, or was but failed, we are not sure.. And this GCS failure is bizarre because we are running these jobs and the are using the same cached packages, and it's not like they are getting deleted because most of the time we get cache hits it's working fine, on multiple subsequent runs.. The packages are small, sub 1MB, and we event tried increasing the RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S env flag, even though we know it's not that because our setup takes a couple of seconds, far bellow the 600s default value..
One point that we noticed by cross referencing logs and timestamps is that when runtime_env_setup does a cache hit the gcs servers URI table is empty, and after a slight delay you get those packages in the table, and what seems to happen is a worker asking for that package in that slight time period between those 2 events (this is all speculation)..

Hope this is of any help in recreating or solving this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

4 participants