Skip to content

Commit b86c52b

Browse files
committed
Proposal for karpenter intergation
1 parent 5e00ed3 commit b86c52b

File tree

2 files changed

+325
-0
lines changed

2 files changed

+325
-0
lines changed
Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Proposal-106: Support scaling with Spot instances for cost saving with Karpenter
2+
3+
<!--
4+
This is the title of your Proposal. Keep it short, simple, and descriptive. A good
5+
title can help communicate what the Proposal is and should be considered as part of
6+
any review.
7+
-->
8+
9+
<!--
10+
A table of contents is helpful for quickly jumping to sections of a Proposal and for
11+
highlighting any additional information provided beyond the standard Proposal
12+
template.
13+
14+
Ensure the TOC is wrapped with
15+
<code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
16+
tags, and then generate with `hack/update-toc.sh`.
17+
-->
18+
19+
<!-- toc -->
20+
- [Proposal-106: Support scaling with Spot instances for cost saving with Karpenter](#proposal-106-support-scaling-with-spot-instances-for-cost-saving-with-karpenter)
21+
- [Summary](#summary)
22+
- [Motivation](#motivation)
23+
- [Goals](#goals)
24+
- [Non-Goals](#non-goals)
25+
- [Proposal](#proposal)
26+
- [User Stories (Optional)](#user-stories-optional)
27+
- [Story 1: ML Engineer – Cost-Efficient Deployment of LLMs](#story-1-ml-engineer--cost-efficient-deployment-of-llms)
28+
- [Story 2: Workload Author – Flexible and Preferred GPU Scheduling with Spot Support](#story-2-workload-author--flexible-and-preferred-gpu-scheduling-with-spot-support)
29+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
30+
- [Risks and Mitigations](#risks-and-mitigations)
31+
- [Design Details](#design-details)
32+
- [Test Plan](#test-plan)
33+
- [Prerequisite testing updates](#prerequisite-testing-updates)
34+
- [Unit tests](#unit-tests)
35+
- [Integration tests](#integration-tests)
36+
- [e2e tests](#e2e-tests)
37+
- [Graduation Criteria](#graduation-criteria)
38+
- [Implementation History](#implementation-history)
39+
- [Drawbacks](#drawbacks)
40+
- [Alternatives](#alternatives)
41+
<!-- /toc -->
42+
43+
## Summary
44+
45+
<!--
46+
This section is incredibly important for producing high-quality, user-focused
47+
documentation such as release notes or a development roadmap. It should be
48+
possible to collect this information before implementation begins, in order to
49+
avoid requiring implementors to split their attention between writing release
50+
notes and implementing the feature itself. Proposal editors and SIG Docs
51+
should help to ensure that the tone and content of the `Summary` section is
52+
useful for a wide audience.
53+
54+
A good summary is probably at least a paragraph in length.
55+
56+
Both in this section and below, follow the guidelines of the [documentation
57+
style guide]. In particular, wrap lines to a reasonable length, to make it
58+
easier for reviewers to cite specific portions, and to minimize diff churn on
59+
updates.
60+
61+
-->
62+
63+
64+
## Motivation
65+
66+
<!--
67+
This section is for explicitly listing the motivation, goals, and non-goals of
68+
this Proposal. Describe why the change is important and the benefits to users. The
69+
motivation section can optionally provide links to [experience reports] to
70+
demonstrate the interest in a Proposal within the wider InftyAI community.
71+
72+
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
73+
-->
74+
75+
76+
### Goals
77+
78+
<!--
79+
List the specific goals of the Proposal. What is it trying to achieve? How will we
80+
know that this has succeeded?
81+
-->
82+
83+
- Provision spot instances for inference workloads based on the model's flavor requirements.
84+
- Support flexible and preferred GPU scheduling with spot instances.
85+
- This proposal is only for AWS but the implementation can be extended to other cloud providers.
86+
87+
### Non-Goals
88+
89+
<!--
90+
What is out of scope for this Proposal? Listing non-goals helps to focus discussion
91+
and make progress.
92+
-->
93+
94+
- Integration with the [Kubernetes Cluster Autoscaler](https://github.com/kubernetes/autoscaler) is out of scope for this proposal.
95+
- Add custom scheduler support for the upstream of the Karpenter project, and it is tracked in [this issue](https://github.com/kubernetes-sigs/karpenter/issues/742). Once the support is added in the upstream, we don't need to maintain the [forked version](https://github.com/InftyAI/karpenter).
96+
97+
## Proposal
98+
99+
<!--
100+
This is where we get down to the specifics of what the proposal actually is.
101+
This should have enough detail that reviewers can understand exactly what
102+
you're proposing, but should not include things like API designs or
103+
implementation. What is the desired outcome and how do we measure success?.
104+
The "Design Details" section below is for the real
105+
nitty-gritty.
106+
-->
107+
108+
109+
### User Stories (Optional)
110+
111+
<!--
112+
Detail the things that people will be able to do if this Proposal is implemented.
113+
Include as much detail as possible so that people can understand the "how" of
114+
the system. The goal here is to make this feel real for users without getting
115+
bogged down.
116+
-->
117+
118+
#### Story 1: ML Engineer – Cost-Efficient Deployment of LLMs
119+
120+
As a machine learning engineer deploying large language models (LLMs), I don't own any physical GPU servers, so I have to rent them from cloud providers. I want to automatically use cheaper GPU Spot instances for serving models when available, so that I can reduce infrastructure costs without sacrificing performance.
121+
122+
#### Story 2: Workload Author – Flexible and Preferred GPU Scheduling with Spot Support
123+
124+
As a workload author, I want to publish Kubernetes manifests for my model-serving workloads that are broadly compatible across different device types, without being overly prescriptive or requiring end users to modify them.
125+
126+
- My workloads are GPU-dependent, but there are many different GPU models available in the cloud (e.g., A10, A100, H100).
127+
- Instead of locking my manifests to a single GPU type, I want to express a preference-ordered list of compatible GPU types (e.g., prefer A100, fall back to A10 or L4).
128+
- This gives end users the flexibility to run the same manifest on different underlying infrastructure.
129+
- If none of the existing nodes in the cluster meet the constraints (e.g., no compatible GPUs available), I want the system to automatically provision an appropriate Spot instance from the cloud provider, based on my declared GPU preferences and resource requirements.
130+
131+
This approach allows me to build and share portable workloads that are cost-aware, device-flexible, and production-safe, without the need for users to rewrite manifests or manage instance-level complexity themselves.
132+
133+
### Notes/Constraints/Caveats (Optional)
134+
135+
<!--
136+
What are the caveats to the proposal?
137+
What are some important details that didn't come across above?
138+
Go in to as much detail as necessary here.
139+
This might be a good place to talk about core concepts and how they relate.
140+
-->
141+
142+
### Risks and Mitigations
143+
144+
<!--
145+
What are the risks of this proposal, and how do we mitigate? Think broadly.
146+
For example, consider both security and how this will impact the larger
147+
InftyAI ecosystem.
148+
149+
How will security be reviewed, and by whom?
150+
151+
How will UX be reviewed, and by whom?
152+
153+
Consider including folks who also work outside the SIG or subproject.
154+
-->
155+
156+
## Design Details
157+
158+
<!--
159+
This section should contain enough information that the specifics of your
160+
change are understandable. This may include API specs (though not always
161+
required) or even code snippets. If there's any ambiguity about HOW your
162+
proposal will be implemented, this is the place to discuss them.
163+
-->
164+
165+
166+
### Test Plan
167+
168+
<!--
169+
**Note:** *Not required until targeted at a release.*
170+
The goal is to ensure that we don't accept enhancements with inadequate testing.
171+
172+
All code is expected to have adequate tests (eventually with coverage
173+
expectations).
174+
175+
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
176+
-->
177+
178+
[x] I/we understand the owners of the involved components may require updates to
179+
existing tests to make this code solid enough prior to committing the changes necessary
180+
to implement this enhancement.
181+
182+
##### Prerequisite testing updates
183+
184+
<!--
185+
Based on reviewers feedback describe what additional tests need to be added prior
186+
implementing this enhancement to ensure the enhancements have also solid foundations.
187+
-->
188+
189+
##### Unit tests
190+
191+
<!--
192+
In principle every added code should have complete unit test coverage, so providing
193+
the exact set of tests will not bring additional value.
194+
However, if complete unit test coverage is not possible, explain the reason of it
195+
together with explanation why this is acceptable.
196+
-->
197+
198+
<!--
199+
Additionally, for Alpha try to enumerate the core package you will be touching
200+
to implement this enhancement and provide the current unit coverage for those
201+
in the form of:
202+
- <package>: <date> - <current test coverage>
203+
204+
This can inform certain test coverage improvements that we want to do before
205+
extending the production code to implement this enhancement.
206+
-->
207+
208+
Forked karpenter:
209+
210+
- `pkg/controllers/provisioning`: `Model Inference Requirements` is used to check if the model inference requirements are met when provisioning a node. And it will be added to the existing suite tests.
211+
212+
##### Integration tests
213+
214+
<!--
215+
Integration tests allow control of the configuration parameters used to start the binaries under test.
216+
This is different from e2e tests which do not allow configuration of parameters.
217+
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
218+
-->
219+
220+
<!--
221+
This question should be filled when targeting a release.
222+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
223+
224+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
225+
https://storage.googleapis.com/k8s-triage/index.html
226+
-->
227+
228+
N/A.
229+
230+
##### e2e tests
231+
232+
<!--
233+
This question should be filled when targeting a release.
234+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
235+
236+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
237+
https://storage.googleapis.com/k8s-triage/index.html
238+
239+
We expect no non-infra related flakes in the last month as a GA graduation criteria.
240+
-->
241+
242+
- Add one e2e test to make sure the whole system can be launched via helm chart. By leveraging kwok provider from the karpenter repo, we can test the whole system with spot instances without real cloud resources.
243+
- Manually test on EKS with real spot instances using custom image which is built from the forked karpenter.
244+
245+
### Graduation Criteria
246+
247+
<!--
248+
249+
Clearly define what it means for the feature to be implemented and
250+
considered stable.
251+
252+
If the feature you are introducing has high complexity, consider adding graduation
253+
milestones with these graduation criteria:
254+
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
255+
- [Feature gate][feature gate] lifecycle
256+
- [Deprecation policy][deprecation-policy]
257+
258+
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
259+
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
260+
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
261+
-->
262+
263+
## Implementation History
264+
265+
<!--
266+
Major milestones in the lifecycle of a Proposal should be tracked in this section.
267+
Major milestones might include:
268+
- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
269+
- the `Proposal` section being merged, signaling agreement on a proposed design
270+
- the date implementation started
271+
- the first llmaz release where an initial version of the Proposal was available
272+
- the version of llmaz where the Proposal graduated to general availability
273+
- when the Proposal was retired or superseded
274+
-->
275+
276+
- 2025-06-04: Proposal drafted.
277+
278+
## Drawbacks
279+
280+
<!--
281+
Why should this Proposal _not_ be implemented?
282+
-->
283+
284+
## Alternatives
285+
286+
<!--
287+
What other approaches did you consider, and why did you rule them out? These do
288+
not need to be as detailed as the proposal, but should include enough
289+
information to express the idea and why it was not acceptable.
290+
-->
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
title: Support scaling with Spot instances for cost saving
2+
proposal-number: 106
3+
authors:
4+
- carlory
5+
status: provisional
6+
creation-date: 2025-06-04
7+
reviewers:
8+
- kerthcet
9+
approvers:
10+
- kerthcet
11+
12+
see-also: []
13+
replaces: []
14+
15+
# The target maturity stage in the current dev cycle for this proposal.
16+
stage: alpha
17+
18+
# The most recent milestone for which work toward delivery of this proposal has been
19+
# done. This can be the current (upcoming) milestone, if it is being actively
20+
# worked on.
21+
latest-milestone: "v0.2"
22+
23+
# The milestone at which this feature was, or is targeted to be, at each stage.
24+
milestone:
25+
alpha: "v0.2"
26+
beta: TBD
27+
stable: TBD
28+
29+
# The following PRR answers are required at alpha release
30+
# List the feature gate name and the components for which it must be enabled
31+
feature-gates: []
32+
disable-supported: true
33+
34+
# The following PRR answers are required at beta release
35+
metrics: []

0 commit comments

Comments
 (0)