run auto deploy remote model in partially deployed status #3423

Zhangxunmt · 2025-01-23T01:26:08Z

Description

Currently the remote model auto-deploy only happens when the model is not deployed at all, by checking the running worker nodes == 0. But in some edge cases, we'd like to auto-deploy the model even it's in PARTIALLY_DEPLOYED status.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Xun Zhang <[email protected]>

pyek-bot · 2025-01-23T02:38:49Z

plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

@@ -2460,6 +2460,10 @@ public int getWorkerNodesSize(String modelId, FunctionName functionName) {
        return getWorkerNodes(modelId, functionName, false).length;
    }

+    public String[] getTargetWorkerNodes(String modelId) {
+        return modelCacheHelper.getTargetWorkerNodes(modelId);


what do you think returning empty string array here instead of null? since null may cause NPE

ylwu-amzn · 2025-01-23T22:16:54Z

plugin/src/main/java/org/opensearch/ml/model/MLModelCacheHelper.java

+        if (modelCache == null) {
+            return null;
+        }
+        return modelCache.getTargetWorkerNodes();


We should consider deploy to all nodes case.
If deploy to all nodes is try and target work nodes may be [node1, node2] on day1.
Then day2, user add one more nodes to cluster, now the target work nodes should be [node1, node2, node3], so we can deploy to all nodes.

I think the work node from cache will still be [node1, node2], right ? Can't remember the details, maybe it returns null or empty for deploy to all node case ? Can you confirm ?

From what I see the synup job only syncs up the worker nodes, but not the target worker nodes. This is another enhancement needed. Both the target worker nodes and running worker nodes needs to be up-to-date in the memory so it can cover all kinds of cases.

dhrubo-os · 2025-01-29T02:08:04Z

plugin/src/main/java/org/opensearch/ml/model/MLModelManager.java

@@ -2460,6 +2460,10 @@ public int getWorkerNodesSize(String modelId, FunctionName functionName) {
        return getWorkerNodes(modelId, functionName, false).length;
    }

+    public String[] getTargetWorkerNodes(String modelId) {


Can we add java doc to the public method?

dhrubo-os · 2025-01-29T02:09:39Z

Can we add tests?

run auto deploy remote model in partially deployed status

0671963

Signed-off-by: Xun Zhang <[email protected]>

Zhangxunmt requested review from b4sjoo, dhrubo-os, mingshl, jngz-es, model-collapse, rbhavna, ylwu-amzn, zane-neo, austintlee, HenryL27 and xinyual as code owners January 23, 2025 01:26

Zhangxunmt had a problem deploying to ml-commons-cicd-env January 23, 2025 01:27 — with GitHub Actions Failure

pyek-bot reviewed Jan 23, 2025

View reviewed changes

ylwu-amzn reviewed Jan 23, 2025

View reviewed changes

dhrubo-os reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run auto deploy remote model in partially deployed status #3423

run auto deploy remote model in partially deployed status #3423

Zhangxunmt commented Jan 23, 2025

pyek-bot Jan 23, 2025

ylwu-amzn Jan 23, 2025 •

edited

Loading

Zhangxunmt Jan 24, 2025

dhrubo-os Jan 29, 2025

dhrubo-os commented Jan 29, 2025

run auto deploy remote model in partially deployed status #3423

Are you sure you want to change the base?

run auto deploy remote model in partially deployed status #3423

Conversation

Zhangxunmt commented Jan 23, 2025

Description

Related Issues

Check List

pyek-bot Jan 23, 2025

Choose a reason for hiding this comment

ylwu-amzn Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Zhangxunmt Jan 24, 2025

Choose a reason for hiding this comment

dhrubo-os Jan 29, 2025

Choose a reason for hiding this comment

dhrubo-os commented Jan 29, 2025

ylwu-amzn Jan 23, 2025 •

edited

Loading