You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
inf000 opened this issue
Nov 26, 2024
· 3 comments
Labels
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
We have been experiencing issues with RayCluster and it's autoscaler. We have a deployment of RayCluster on a k8s cluster and are using spot instances for workers and a persistent node for the head. We are always scaling out from 0 for the workers when new tasks arrive and scale back to 0 once everything is done. This has been working for us for some time now, but recently we switched from Python 3.8.9 and Ray 2.9.3 to Python 3.11.10 and Ray 2.38.0 and we have been experiencing some really problematic issues, mainly at random we get some pending tasks that need to be scheduled but no jobs are running. This is causing the autoscaler to keep the workers always on even though the workers are doing nothing. The workers are shutting down after the idle timeout period but the autoscaler is provisioning them back because it has some stuck tasks.
Do you have any idea why this is happening and how to fix or workaround this issue since it's impacting our stack significantly.
I'll attach some images, and as you can see we don't have any actors, and there are no jobs that are not finished.
Versions / Dependencies
Python 3.11.10
Ray 2.38.0
Docker image rayproject/ray:2.38.0-py311
Reproduction script
We can't pinpoint when or how the issue happens, we suspect it happens when a job fails at some point either during it's run or as it is instantiated.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
inf000
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Nov 26, 2024
I have encountered the same problem, but I still can't provide a reproducible code. Maybe the probability of stopping the job immediately after submitting a large number of ray tasks will be higher.
Ok a little potential progress on this, we know how it got to those dangling tasks, and discovered a possible issue with runtime environment and GCS.
So our setup was like this, we are triggering couple jobs one after the other, each job instantiates with ray.init, we create a custom runtime environment by sending the pip dependencies and 2 local packages that are sent via py_modules. They are zipped and sent to GCS and it stores them and adds them to the URI table, and URI reference table. The jobs runs everything, finishes, we shut down the connection, the reference count goes to 0 for the URIs and the tables are cleared. At some random point (could be the second, third or n-th job) we instantiate with ray.init, the packages are sent, we can see by the logs that the packages are cached (e.g. Runtime env py_modules gcs://_ray_pkg_53653ff9753e2d43.zip is already installed and will be reused), the init finishes and we start pushing tasks with ray.remote, and at some point while doing that ray throws an OSError (
OSError: Failed to download runtime_env file package gcs://_ray_pkg_53653ff9753e2d43.zip from the GCS to the Ray worker node) and does so on the download_and_unpack_package funciton, specifically the internal_kv_get method. At that point the tasks that were scheduled are left dangling because the job was not created for them, or was but failed, we are not sure.. And this GCS failure is bizarre because we are running these jobs and the are using the same cached packages, and it's not like they are getting deleted because most of the time we get cache hits it's working fine, on multiple subsequent runs.. The packages are small, sub 1MB, and we event tried increasing the RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S env flag, even though we know it's not that because our setup takes a couple of seconds, far bellow the 600s default value..
One point that we noticed by cross referencing logs and timestamps is that when runtime_env_setup does a cache hit the gcs servers URI table is empty, and after a slight delay you get those packages in the table, and what seems to happen is a worker asking for that package in that slight time period between those 2 events (this is all speculation)..
Hope this is of any help in recreating or solving this issue
bugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoretriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
Greetings,
We have been experiencing issues with RayCluster and it's autoscaler. We have a deployment of RayCluster on a k8s cluster and are using spot instances for workers and a persistent node for the head. We are always scaling out from 0 for the workers when new tasks arrive and scale back to 0 once everything is done. This has been working for us for some time now, but recently we switched from Python 3.8.9 and Ray 2.9.3 to Python 3.11.10 and Ray 2.38.0 and we have been experiencing some really problematic issues, mainly at random we get some pending tasks that need to be scheduled but no jobs are running. This is causing the autoscaler to keep the workers always on even though the workers are doing nothing. The workers are shutting down after the idle timeout period but the autoscaler is provisioning them back because it has some stuck tasks.
Do you have any idea why this is happening and how to fix or workaround this issue since it's impacting our stack significantly.
I'll attach some images, and as you can see we don't have any actors, and there are no jobs that are not finished.
Versions / Dependencies
Python 3.11.10
Ray 2.38.0
Docker image rayproject/ray:2.38.0-py311
Reproduction script
We can't pinpoint when or how the issue happens, we suspect it happens when a job fails at some point either during it's run or as it is instantiated.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: