Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release] Release 0.7.1 #4438

Open
wants to merge 19 commits into
base: releases/0.7.1_pure
Choose a base branch
from

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Dec 4, 2024

Based on releases/0.7.0, cherry-picks all commits from 0.7.1

With some manual changes:

  • Only to smoke_tests.py to ensure more smoke tests pass and Buildkite works.
  • And bump up version

The release should include: version 0.7.1 along with the manual changes

  • This PR currently contains only the 0.7.1 updates based on version 0.7.0, and is open for review.
  • The manual changes have been submitted separately in this PR to facilitate easier review.

Code to run test below include: version 0.7.1 along with the manual changes


Smoke tests:

Use buildkite CI to run the following tests:

  • pytest tests/test_smoke.py --aws
  • pytest tests/test_smoke.py --gcp
  • pytest tests/test_smoke.py --azure
  • pytest tests/test_smoke.py --kubernetes

All passes except the failures:

pytest tests/test_smoke.py::test_tpu_vm_pod --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::test_tpu_vm --gcp ---- setup fail, env error, fixed by other PR on master
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --azure --- credential issue, Permission denied by location policies.
pytest tests/test_smoke.py::test_gcp_force_enable_external_ips --gcp --- ssh fail on provision, even on master
pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
pytest tests/test_smoke.py::test_azure_best_tier_failover --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
pytest tests/test_smoke.py::test_azure_disk_tier --azure --- ResourcesUnavailableError
pytest tests/test_smoke.py::test_kubernetes_context_failover --kubernetes --- Resource limit, no h100
pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --aws --- Permission denied by location policies

You can view by clicking the failure from buildkite:
image

Manual tests:

  • locally build docs, open docs/build/index.html, scroll over “CLI Reference” (ideally, every page) to see if there are missing sections (we once caught the CLI page completely missing due to an import error; and once it has weird blockquotes displayed)
  • Check sky -v
  • backward_compatibility_tests.sh run against 0.7.0 on aws, run by buildkite
  • Run manual stress tests (see subsection below)
    • following script
      sky jobs launch --gpus A100:8 --cloud aws echo hi -y
      # Check we are properly failing over the zones:
      sky jobs logs --controller
      
    • following script (Fail due to resource unavaliable)
      sky launch -c dbg --cloud aws --num-nodes 16 --gpus T4 --down --use-spot 
      sky down dbg
      
    • sky launch --num-nodes=75 -c dbg --cpus 2+ --use-spot --down --cloud aws -y
    • many jobs
# Launching many jobs on a cluster
sky launch -c test-many-jobs --cloud aws --cpus 16 --region us-east-1
python3 -c "
import subprocess
from multiprocessing.pool import ThreadPool

def run_task(task):
    print(f'Running task {task}')
    subprocess.run(f'sky exec test-many-jobs -d \"echo hi {task}; sleep 60\"', shell=True)

pool = ThreadPool(8)
pool.map(run_task, range(1000))
"
# Test the job queue on cluster is correct
sky queue test-many-jobs
  • sky show-gpus manual tests

  • Run a 24-hour+ spot job and ensure it doesn’t OOM
    sky spot launch -n test-oom --cloud aws --cpus 2 sleep 1000000000000000

Michaelvll and others added 9 commits December 4, 2024 18:34
…ing (skypilot-org#4264)

* fix race condition for setting job status to FAILED during INIT

* Fix

* fix

* format

* Add smoke tests

* revert pending submit

* remove update entirely for the job schedule step

* wait for job 32 to finish

* fix smoke

* move and rename

* Add comment

* minor
* Avoid job schedule race condition

* format

* format

* Avoid race for cancel
…ounts are specified (skypilot-org#4317)

do file mounts if storage is specified
* avoid catching ValueError during failover

If the cloud api raises ValueError or a subclass of ValueError during instance
termination, we will assume the cluster was downed. Fix this by introducing a
new exception ClusterDoesNotExist that we can catch instead of the more general
ValueError.

* add unit test

* lint
@zpoint zpoint changed the title [Release] Release 0.7.0 [Release] Release 0.7.1 Dec 4, 2024
@zpoint zpoint changed the base branch from releases/0.7.1 to releases/0.7.1_pure December 4, 2024 10:51
cg505 and others added 3 commits December 9, 2024 10:58
…g#4443)

* if a newly-created cluster is missing from the cloud, wait before deleting

Addresses skypilot-org#4431.

* confirm cluster actually terminates before deleting from the db

* avoid deleting cluster data outside the primary provision loop

* tweaks

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* use usage_intervals for new cluster detection

get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.

* fix terminating/stopping state for Lambda and Paperspace

* Revert "use usage_intervals for new cluster detection"

This reverts commit aa6d2e9.

* check cloud.STATUS_VERSION before calling query_instances

* avoid try/catch when querying instances

* update comments

---------

Co-authored-by: Zhanghao Wu <[email protected]>
* smoke tests support storage mount only

* fix verify command

* rename to only_mount
@zpoint zpoint requested a review from Michaelvll December 10, 2024 04:30
@romilbhardwaj romilbhardwaj self-requested a review December 10, 2024 18:25
@@ -1144,7 +1144,7 @@ def test_gcp_stale_job_manual_restart():
# Ensure the skylet updated the stale job status.
_get_cmd_wait_until_job_status_contains_without_matching_job(
cluster_name=name,
job_status=[JobStatus.FAILED.value],
job_status=[JobStatus.FAILED],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this kind of hot fixes, we may want to include it in master and cherry pick it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to a merge conflict. The master branch's value is FAILED_DRIVER, which does not exist in version 0.7.1 but is correct in the master branch.

@Michaelvll
Copy link
Collaborator

Michaelvll commented Dec 19, 2024

Looking at the test failures (checked ones should be fine):

  • pytest tests/test_smoke.py::test_tpu_vm_pod --gcp ---- setup fail, env error, fixed by other PR on master
  • pytest tests/test_smoke.py::test_tpu_vm --gcp ---- setup fail, env error, fixed by other PR on master
  • pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --azure --- credential issue, Permission denied by location policies for me-central2
  • pytest tests/test_smoke.py::TestStorageWithCredentials::test_gcs_regions --aws --- Permission denied by location policies
  • pytest tests/test_smoke.py::test_gcp_force_enable_external_ips --gcp --- ssh fail on provision, even on master
    TODO: we should add a skip for this smoke test as this will only work running on a GCP instance.

The following does not fail on release/0.7.0, we should fix:

  • pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
    Seems the GCP credential needs reauth on the agent? Should we switch to service account?
  • pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
    TODO: @cblmemo do you know what is the reason here?
  • pytest tests/test_smoke.py::test_kubernetes_context_failover --kubernetes --- Resource limit, no h100
    It should pass with the setup below. Can we try to set this up in buildkite?
    def test_kubernetes_context_failover():
    """Test if the kubernetes context failover works.
    This test requires two kubernetes clusters:
    - kind-skypilot: the local cluster with mock labels for 8 H100 GPUs.
    - another accessible cluster: with enough CPUs
    To start the first cluster, run:
    sky local up
    # Add mock label for accelerator
    kubectl label node --overwrite skypilot-control-plane skypilot.co/accelerator=h100 --context kind-skypilot
    # Get the token for the cluster in context kind-skypilot
    TOKEN=$(kubectl config view --minify --context kind-skypilot -o jsonpath=\'{.users[0].user.token}\')
    # Get the API URL for the cluster in context kind-skypilot
    API_URL=$(kubectl config view --minify --context kind-skypilot -o jsonpath=\'{.clusters[0].cluster.server}\')
    # Add mock capacity for GPU
    curl --header "Content-Type: application/json-patch+json" --header "Authorization: Bearer $TOKEN" --request PATCH --data \'[{"op": "add", "path": "/status/capacity/nvidia.com~1gpu", "value": "8"}]\' "$API_URL/api/v1/nodes/skypilot-control-plane/status"
    # Add a new namespace to test the handling of namespaces
    kubectl create namespace test-namespace --context kind-skypilot
    # Set the namespace to test-namespace
    kubectl config set-context kind-skypilot --namespace=test-namespace --context kind-skypilot
    """
  • pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
    TODO: try to change the region to eastus2 and fix it on master @zpoint
  • pytest tests/test_smoke.py::test_azure_best_tier_failover --azure --- ResourcesUnavailableError
    TODO: try change the region to eastus2 and fix it on master @zpoint
  • pytest tests/test_smoke.py::test_azure_disk_tier --azure --- ResourcesUnavailableError
    TODO: try to change to eastus2 and fix it on master @zpoint

@cblmemo
Copy link
Collaborator

cblmemo commented Dec 19, 2024

pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
TODO: @cblmemo do you know what is the reason here?

Does this issue persists? Since Azure provisioning is relatively slow, it is possible that sometimes it passes the initial delay and sometimes not.

Also, I'm a little bit confused - why is there a expected FAILED status?

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 20, 2024

pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
TODO: @cblmemo do you know what is the reason here?

Does this issue persists? Since Azure provisioning is relatively slow, it is possible that sometimes it passes the initial delay and sometimes not.

image

I've tried many times with no luck. The failure rate is high, even if it's flaky. Could we fix the flakiness?

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 20, 2024

pytest tests/test_smoke.py::test_managed_jobs_storage --azure --- ? Fail to provision ?
TODO: try to change the region to eastus2 and fix it on master @zpoint

After changing the region, I found that this test case needs to be run on the aws controller. If we don't have a controller running, sky launches an azure controller, which then fails due to missing aws credentials. Is this a bug? @Michaelvll

(t-managed-jobs-storage-8b, pid=2429) Traceback (most recent call last):
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(t-managed-jobs-storage-8b, pid=2429)     return _run_code(code, main_globals, None,
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/miniconda3/envs/skypilot-runtime/lib/python3.10/runpy.py", line 86, in _run_code
(t-managed-jobs-storage-8b, pid=2429)     exec(code, run_globals)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 583, in <module>
(t-managed-jobs-storage-8b, pid=2429)     start(args.job_id, args.dag_yaml, args.retry_until_up)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 541, in start
(t-managed-jobs-storage-8b, pid=2429)     _cleanup(job_id, dag_yaml=dag_yaml)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 480, in _cleanup
(t-managed-jobs-storage-8b, pid=2429)     dag, _ = _get_dag_and_name(dag_yaml)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/jobs/controller.py", line 40, in _get_dag_and_name
(t-managed-jobs-storage-8b, pid=2429)     dag = dag_utils.load_chain_dag_from_yaml(dag_yaml)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/utils/dag_utils.py", line 101, in load_chain_dag_from_yaml
(t-managed-jobs-storage-8b, pid=2429)     task = task_lib.Task.from_yaml_config(task_config, env_overrides)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/task.py", line 438, in from_yaml_config
(t-managed-jobs-storage-8b, pid=2429)     storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1043, in from_yaml_config
(t-managed-jobs-storage-8b, pid=2429)     storage_obj = cls(name=name,
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 556, in __init__
(t-managed-jobs-storage-8b, pid=2429)     self.add_store(StoreType.S3)
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 894, in add_store
(t-managed-jobs-storage-8b, pid=2429)     store = store_cls(
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1110, in __init__
(t-managed-jobs-storage-8b, pid=2429)     super().__init__(name, source, region, is_sky_managed,
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 261, in __init__
(t-managed-jobs-storage-8b, pid=2429)     self._validate()
(t-managed-jobs-storage-8b, pid=2429)   File "/home/azureuser/skypilot-runtime/lib/python3.10/site-packages/sky/data/storage.py", line 1156, in _validate
(t-managed-jobs-storage-8b, pid=2429)     raise exceptions.ResourcesUnavailableError(
(t-managed-jobs-storage-8b, pid=2429) sky.exceptions.ResourcesUnavailableError: Storage 'store: s3' specified, but AWS access is disabled. To fix, enable AWS by running `sky check`. More info: https://docs.skypilot.co/en/latest/getting-started/installation.html.

pytest tests/test_smoke.py::test_file_mounts --azure ---Failed to run command before rsync ?
Seems the GCP credential needs reauth on the agent? Should we switch to service account?

It's aws sync error, not gcp, and it's a 100% reproduction failure. @Michaelvll

E 12-20 16:05:02 subprocess_utils.py:141] Successfully installed PyYAML-6.0.2 awscli-1.36.26 botocore-1.35.85 colorama-0.4.6 docutils-0.16 jmespath-1.0.1 pyasn1-0.6.1 rsa-4.7.2 s3transfer-0.10.4
E 12-20 16:05:02 subprocess_utils.py:141] fatal error: Unable to locate credentials
E 12-20 16:05:02 subprocess_utils.py:141] 

Traceback (most recent call last):
  File "/Users/zepingguo/miniconda3/envs/sky/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 838, in invoke
    return super().invoke(ctx)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/zepingguo/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 1159, in launch
    _launch_with_confirm(task,
  File "/Users/zepingguo/Desktop/skypilot/sky/cli.py", line 628, in _launch_with_confirm
    sky.launch(
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/execution.py", line 529, in launch
    return _execute(
  File "/Users/zepingguo/Desktop/skypilot/sky/execution.py", line 329, in _execute
    backend.sync_file_mounts(handle, task.file_mounts,
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 386, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/common_utils.py", line 366, in _record
    return f(*args, **kwargs)
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/backend.py", line 101, in sync_file_mounts
    return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/cloud_vm_ray_backend.py", line 3174, in _sync_file_mounts
    self._execute_file_mounts(handle, all_file_mounts)
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/cloud_vm_ray_backend.py", line 4634, in _execute_file_mounts
    backend_utils.parallel_data_transfer_to_nodes(
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/backend_utils.py", line 1440, in parallel_data_transfer_to_nodes
    subprocess_utils.run_in_parallel(_sync_node, runners, num_threads)
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/subprocess_utils.py", line 121, in run_in_parallel
    return list(p.imap(func, args))
  File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value
  File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/zepingguo/Desktop/skypilot/sky/backends/backend_utils.py", line 1418, in _sync_node
    subprocess_utils.handle_returncode(rc,
  File "/Users/zepingguo/Desktop/skypilot/sky/utils/subprocess_utils.py", line 148, in handle_returncode
    raise exceptions.CommandError(returncode, command, format_err_msg,
sky.exceptions.CommandError: Command mkdir -p ~/.sky/file_mounts/s3-data-test && aws --version >/dev/null 2>&1 || pip3 install awscli && aws s3 sync --no-follow-symlinks s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0 ~/.sky/file_mounts/s3-data-test failed with return code 1.
Failed to run command before rsync s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0 -> /s3-data-test. Ensure that the network is stable, then retry. mkdir -p ~/.sky/file_mounts/s3-data-test && aws --version >/dev/null 2>&1 || pip3 install awscli && aws s3 sync --no-follow-symlinks s3://fah-public-data-covid19-cryptic-pockets/human/il6/PROJ14534/RUN999/CLONE0/results0 ~/.sky/file_mounts/s3-data-test See logs in ~/sky_logs/sky-2024-12-20-15-58-06-704254/file_mounts.log
D 12-20 16:05:02 skypilot_config.py:228] Using config path: /Users/zepingguo/.sky/config.yaml
D 12-20 16:05:02 skypilot_config.py:233] Config loaded:

@zpoint zpoint mentioned this pull request Dec 20, 2024
5 tasks
@cblmemo
Copy link
Collaborator

cblmemo commented Dec 21, 2024

pytest tests/test_smoke.py::test_skyserve_new_autoscaler_update --azure --- Got FAILED_INITIAL_DELAY instead of FAILED
TODO: @cblmemo do you know what is the reason here?

Does this issue persists? Since Azure provisioning is relatively slow, it is possible that sometimes it passes the initial delay and sometimes not.

image I've tried many times with no luck. The failure rate is high, even if it's flaky. Could we fix the flakiness?

Does increasing the initial delay works for you?

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 24, 2024

I tried three times and passed on the third attempt. Then, I took the test one more time and passed again.

(Didn't modify any code)The success rate is now acceptable. I think increasing the initial delay will work, but I’m still testing.

First Attempt Failure Log:
image

Second Attempt Failure Log:

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants