No bid rejection for transient reasons. E.g. no available capacity #3992

wdbaruni · 2024-05-13T07:34:34Z

To simplify our scheduling logic and minimize the edge cases and decision branches, I would like us to move to a state where bid rejections by compute nodes are only for terminal reasons and not for transient reasons, such as no available capacity. Compute nodes already queue jobs locally if they don't have immediate capacity, they can share both the available and queued capacity with the orchestrator, and the orchestrator can make a decision on ranking nodes based on those capacities and can filter compute nodes with a large queue and retry later. This enables higher priority jobs to be scheduled in compute nodes with large queue of pending jobs, but get the chance to run sooner if we use priority queues in the executor_buffer

Even if the Orchestrator's view of the network's utilization is stale, we have solutions that can mitigation that including:

Use Sparrow scheduling where the orchestrator will select multiple compute nodes, and they only accept when the job reaches the head of the of the queue and ready to be executed Remove bidding related semantics from compute APIs #3856
Add a Housekeeping task that tracks executions taking too long in negotiation phase and enqueue an evaluation. The scheduler might decide to pick another node if there are better ranked ones or keep the same node otherwise

This PR not only closes #3992, but also adds a node ranker based on available and queued capacity, which we didn't have before. I had to introduce the ranker along with this PR to make sure some tests are passing that were relying on bid rejection due to exceeding resource limits, specifically `TestParallelGPU`

This PR introduces job queueing when no matching node is available in the network. This can be due to all nodes are currently busy processing other jobs, or no node matches the job constraints, such as label selectors, engines or publishers. ## QueueTimeout By default, queueing is disabled and jobs will fail immediately. Users can enable queueing and how long a job can wait in the queue by setting `QueueTimeout` to a value greater than zero. There are two ways to set this value: ### Job Spec Users can set this value in the job spec when calling `bacalhau job run spec.yaml` such as: ``` Type: batch Count: 1 Tasks: - Name: main Engine: Type: docker Params: Image: ubuntu:latest Entrypoint: - /bin/bash Parameters: - -c - sleep 90 Timeouts: QueueTimeout: 3600 ``` ### Requester Node Configuration Operators can set a default `QueueTimeout` in the Requester node's configurations so that all submitted jobs with no `QueueTimeout` can be assigned the configured default value. The configuration looks like: ``` Node: Requester: JobDefaults: QueueTimeout: 3600s Scheduler: QueueBackoff: 1m0s ``` ## QueueBackoff The wait the requester node works is that will keep retrying scheduling the jobs every `QueueBackoff` window, which is also configured as shown above and defaults to 1 minute. A future improvement is to remove `QueueBackoff` and let the scheduler listen to node and cluster changes and re-queue a job only when it believes it can be rescheduled instead of just blindly retrying every `QueueBackoff`. ## Testing A [pre-release](https://github.com/bacalhau-project/bacalhau/releases/tag/v1.3.2-rc2) has been cut with this change along with #4051, and has been deployed to development. You can also using the below examples to test against development, just make sure you are using the same client in the pre-release ### Caveat The compute nodes heartbeat their available resources every 30seconds. If there is a spike in jobs submitted in a short period of time, the the requester might over subscriber a compute node as it will take time before it knows it is full. This won't fail the jobs, but will the compute nodes will queue the jobs locally instead of the requester. If new compute nodes join, the requester won't move jobs from the first compute node. This is related to moving away from rejecting jobs because the local queue is full discussed [here](#3992). There are may ways to improve this, and I'll open a follow up issue for it, but for now wait some time between job submission to have more predictable tests. ### Sample Job This is a sample job that takes 5 minutes to finish, configured with queueing enabled up to 1 hour, and requires 3 CPU units. There are two compute nodes in development with 3.2 CPU units each. ``` Name: A slow job Type: batch Count: 1 Tasks: - Name: main Engine: Type: docker Params: Image: ubuntu:latest Entrypoint: - /bin/bash Parameters: - -c - sleep 300 Resources: CPU: "3" Timeouts: QueueTimeout: 3600 ``` ### Scenario 1: Busy resources ``` # both jobs should start immediately in different nodes bacalhau job run --wait=false slow.yaml bacalhau job run --wait=false slow.yaml # validate the jobs are running in different nodes bacalhau job describe `<job_id>` # wait >30 seconds. Most likely they will be in pending state, # but it can happen that the requester is not aware yet of available resources. # wait and try again until the job state is pending bacalhau job run --wait=false slow.yaml # After >5 minutes, describe the pending job and it should move from pending to running bacalhau job describe `<job_id>` ``` ### Scenario 2: No available node Run job that only asks for that ask for a node with `name=walid` or any other name ``` Name: A constrained job Type: batch Count: 1 Constraints: - operator: "=" Key: "name" Values: ["walid"] Tasks: - Name: main Engine: Type: docker Params: Image: ubuntu:latest Entrypoint: - /bin/bash Parameters: - -c - sleep 10 Resources: CPU: "0.1" Timeouts: QueueTimeout: 3600 ``` Run the job and describe it. It should be in pending state and not failed ``` bacalhau job run --wait=false constrained.yaml ``` Join you rmachine as a compute node in a separate terminal, and give it the unique label, like `name=walid` ``` export BACALHAU_DIR=$(mktemp -d) bacalhau serve --node-type=compute --orchestrators=bootstrap.development.bacalhau.org:4222 --labels name=walid ``` Describe the job again and it should be in running or completed state ### Scenario 3: No queueing Test the previous scenarios with no queue timeout defined, and the jobs should fail immediately. ## Future Improvements 1. Improve visibility of queued jobs, and why they are being queued (P0) 2. Add a `--queue-timeout` flag to `docker run` to allow queueing with imperative job submissions (P1) 3. Improve detection of available and local queue capacity of compute nodes to avoid over-subscribing nodes (P2) 4. Move away from `QueueBackoff` to listening to cluster state changes (Not a priority) --------- Co-authored-by: Forrest <[email protected]>

wdbaruni mentioned this issue May 13, 2024

Queue jobs in Orchestrator when no matching nodes are found #3987

Open

wdbaruni self-assigned this May 13, 2024

wdbaruni added the comp/scheduler Related to job scheduling components label May 13, 2024

wdbaruni mentioned this issue May 15, 2024

compute nodes no longer reject bids based on queued capacity #4002

Merged

wdbaruni closed this as completed in #4002 May 20, 2024

wdbaruni mentioned this issue May 31, 2024

introduce job queueing when no nodes were found #4049

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No bid rejection for transient reasons. E.g. no available capacity #3992

No bid rejection for transient reasons. E.g. no available capacity #3992

wdbaruni commented May 13, 2024 •

edited

No bid rejection for transient reasons. E.g. no available capacity #3992

No bid rejection for transient reasons. E.g. no available capacity #3992

Comments

wdbaruni commented May 13, 2024 • edited

wdbaruni commented May 13, 2024 •

edited