Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SERVE] Allow adjustment of scaling policies without redeployment #4442

Open
JGSweets opened this issue Dec 5, 2024 · 6 comments
Open

[SERVE] Allow adjustment of scaling policies without redeployment #4442

JGSweets opened this issue Dec 5, 2024 · 6 comments

Comments

@JGSweets
Copy link
Contributor

JGSweets commented Dec 5, 2024

Currently, when altering the replica_policy, update runs a pseudo blue-green deployment in the sense it launches all new resources.

Preferably, if only the replica_policy is changing, it alters the policy itself without deploying /tearing down new instances unless required by the new policy.


Example 1:
Init: Currently, 2 resources are running, but the min_replica is set to 3.
Result: Only 1 instance is launched.

Example 2:
Init: Currently, 2 resources are running, but the min_replica is set to 1 and qps would not be met if scaled down.
Result: 1 instance is torn down.


Solution Options:

  1. Have update check to make sure only the replica_info has changed with a hash.
  2. Use a flag with update that allows altering just the replica policy
  3. Have a separate update endpoint which allows updating the replica policy.

Version & Commit info:
skypilot, version 0.7.0
skypilot, commit 3f62588

@Michaelvll
Copy link
Collaborator

cc'ing @cblmemo

@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@cblmemo
Copy link
Collaborator

cblmemo commented Dec 21, 2024

Hi @JGSweets , thanks for reporting this! However, I think we already have this feature, see:

# Reuse all replicas that have the same config as the new version
# (except for the `service` field) by directly setting the version to be
# the latest version. This can significantly improve the speed
# for updating an existing service with only config changes to the
# service specs, e.g. scale down the service.
new_config = common_utils.read_yaml(os.path.expanduser(task_yaml_path))
# Always create new replicas and scale down old ones when file_mounts
# are not empty.
if new_config.get('file_mounts', None) != {}:
return
for key in ['service']:
new_config.pop(key)
replica_infos = serve_state.get_replica_infos(self._service_name)
for info in replica_infos:
if info.version < version and not info.is_terminal:
# Assume user does not change the yaml file on the controller.
old_task_yaml_path = serve_utils.generate_task_yaml_file_name(
self._service_name, info.version)
old_config = common_utils.read_yaml(
os.path.expanduser(old_task_yaml_path))
for key in ['service']:
old_config.pop(key)
# Bump replica version if all fields except for service are
# the same. File mounts should both be empty, as update always
# create new buckets if they are not empty.
if (old_config == new_config and
old_config.get('file_mounts', None) == {}):
logger.info(
f'Updating replica {info.replica_id} to version '
f'{version}. Replica {info.replica_id}\'s config '
f'{old_config} is the same as '
f'latest version\'s {new_config}.')
info.version = version
serve_state.add_or_update_replica(self._service_name,
info.replica_id, info)

I also tried the following on current master (ee3cabd57247ff0f25cb65c0ee46bd35ead8d11a):

# now `service.replicas` field is 1
sky serve up -n minimal examples/serve/minimal.yaml
# change `service.replicas` field to 2
sky serve update minimal examples/serve/minimal.yaml

and got the following:

image

Noticed that the replica with ID 1 has a version of 2, which means this replica's version is bumped and the replica is reused.

Could you share more of your usage? If that does not works for you, it is possible that there are some bugs in our system, and some related tests could help us find it ;)

@JGSweets
Copy link
Contributor Author

Interesting, I had an experience recently where I increased min_replicas 2->4 and max_replicas 3->4, but it scaled up 4 new resources like the blue green deployment mentioned in the docs.

@JGSweets
Copy link
Contributor Author

Is the intended functionality mentioned in the docs?

@cblmemo
Copy link
Collaborator

cblmemo commented Dec 21, 2024

Is the intended functionality mentioned in the docs?

Yes, pls check the first hint in this doc: https://docs.skypilot.co/en/latest/serving/update.html

@JGSweets
Copy link
Contributor Author

I'll have to see if I can tease the issue out more / can replicate it. I thought using old resources was the intended functionality, so glad you confirmed it.

I'd need to verify that ami didn't change ain addition to the min/max replicas. Possibly what happened, but I thought it was the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants