eks-prow-build-cluster: Reconsider instance type selection #5066

tzneal · 2023-03-30T13:59:28Z

What should be cleaned up or changed:

Some changes were made to the EKS cluster to attempt to resolve an issue with test flakes. These changes also increased the per-node cost. We should consider reverting these changes to reduce cost.

a) Changing to an instance type without instance storage.

b) Changing back to an AMD CPU type

c) Changing to a roughly 8 CPU / 64GB type to more closely match the existing GCP cluster nodes

The cluster currently uses an r5d.4xlarge (16 CPU/ 128 GB) with an on-demand cost of 1.152

An r5a.4xlarge (16 CPU / 128 GB) has an on-demand cost of 0.904 per hour

An r5a.2xlarge (8 CPU / 64 GB) has an on-demand cost of 0.45 per hour

Provide any links for context:

tzneal · 2023-03-30T14:01:11Z

/sig k8s-infra

xmudrii · 2023-04-02T20:07:11Z

I'm going to transfer this issue to k/k8s.io as other issues related to this cluster are already there.
/transfer-issue k8s.io

xmudrii · 2023-04-02T20:07:54Z

/assign @xmudrii @pkprzekwas

BenTheElder · 2023-04-02T20:08:16Z

One thing to consider: Because kubernetes doesn't have IO/IOPS isolation, sizing really large nodes changes the CPU : I/O ratio. (Though this will also not be 1:1 between GCP and AWS anyhow), so while really large nodes can allow high core count jobs OR bin packing more jobs per node ... the latter can cause issues by over-packing for I/O throughput.

This is less of an issue today than when we ran bazel builds widely, but it's still something that can cause performance issues. The existing size is semi-arbitrary though, and may be somewhat GCP specific, but right now tests that are likely to be IO heavy sometimes reserve that IO by reserving ~all of the CPU at our current node sizes.

xmudrii · 2023-04-02T20:08:48Z

xref #4686

xmudrii · 2023-04-02T20:13:17Z

To add to what @BenTheElder said: we already had issues with GOMAXPROCS for unit tests. We "migrated" 5 jobs so far and one was affected (potentially one more). To avoid such issues, we might want to have instances close to what we have on GCP. We can't have 1:1 mapping, but we can try using similar instances based on what AWS offers.

Not having to deal with stuff such as GOMAXPROCS is going to make the migration more smooth and we'll avoid spending a lot of time on debugging such issues.

dims · 2023-04-02T20:15:01Z

@xmudrii fyi kubernetes/kubernetes#117016

xmudrii · 2023-04-02T20:22:04Z

@dims Thanks for driving this forward. But just to note, this fixes it only for k/k, other subprojects might be affected by it and would need to apply a similar patch.

BenTheElder · 2023-04-02T20:34:18Z

Go is expected to solve GOMAXPROCS upstream, it's been accepted to detect this in the stdlib, and GOMAXPROCS can also be set in the CI in the meantime, as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

tzneal · 2023-04-03T02:17:46Z

as-is jobs already have this wrong and we should resolve that independently of selecting node-size.

+1 for setting this on existing jobs. I have a secret hope that it might generally reduce flakiness a bit.

TerryHowe · 2023-04-03T14:31:08Z

Maybe try some bare metal node like an m5.2xlarge or m6g.2xlarge?

xmudrii · 2023-04-03T15:24:28Z

@TerryHowe We need to use memory optimized instances because our jobs tend to use a lot of memory.

xmudrii · 2023-04-03T15:41:54Z

Update: we decided to go with a 3 step phased approach:

Switch from r5d.4xlarge to r6id.2xlarge (this instance size should be very close to what we have on GCP)
Switch from r6id.2xlarge to r6i.2xlarge (i.e. switch from SSDs to EBS)
Switch from r6i.2xlarge to r6a.2xlarge (i.e. switch to AMD CPUs)

Note: the order of phases might get changed.

Each phase should last at least 24 hours to ensure that tests are stable. I just started the first phase and I think we should leave it on until Wednesday morning CEST.

xmudrii · 2023-04-03T16:24:59Z

Update: we tried r6id.2xlarge but it seems that 8 vCPUs are not enough:

  Type     Reason             Age   From                Message
  ----     ------             ----  ----                -------
  Warning  FailedScheduling   44s   default-scheduler   0/20 nodes are available: 20 Insufficient cpu. preemption: 0/20 nodes are available: 20 No preemption victims found for incoming pod.
  Normal   NotTriggerScaleUp  38s   cluster-autoscaler  pod didn't trigger scale-up: 1 Insufficient cpu

I'm trying r5ad.4xlarge instead.

xmudrii · 2023-04-25T15:40:08Z

/retitle eks-prow-build-cluster: Reconsider instance type selection

ameukam · 2023-11-15T08:02:11Z

@xmudrii are we still doing this ? Do we want to use a instance type with less resources ?

xmudrii · 2023-11-15T09:47:10Z

@ameukam I would still like to take a look into this, but we'd mostly like need to adopt Karpenter to be able to do this (#5168)
/lifecycle frozen

xmudrii · 2024-02-12T16:09:46Z

Blocked by #5168
/unassign @xmudrii @pkprzekwas

tzneal added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 30, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 30, 2023

k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 30, 2023

BenTheElder added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Mar 31, 2023

k8s-ci-robot transferred this issue from kubernetes/test-infra Apr 2, 2023

k8s-ci-robot assigned pkprzekwas and xmudrii Apr 2, 2023

xmudrii mentioned this issue Apr 4, 2023

eks-prow-build-cluster: use AMD-based instances #5073

Merged

xmudrii mentioned this issue Apr 25, 2023

eks-prow-build-cluster: Costs optimization #5167

Closed

k8s-ci-robot changed the title ~~Consider instance type selection on the EKS Cluster~~ eks-prow-build-cluster: Reconsider instance type selection Apr 25, 2023

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Nov 15, 2023

k8s-ci-robot unassigned pkprzekwas and xmudrii Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eks-prow-build-cluster: Reconsider instance type selection #5066

eks-prow-build-cluster: Reconsider instance type selection #5066

tzneal commented Mar 30, 2023

tzneal commented Mar 30, 2023

xmudrii commented Apr 2, 2023

xmudrii commented Apr 2, 2023

BenTheElder commented Apr 2, 2023

xmudrii commented Apr 2, 2023

xmudrii commented Apr 2, 2023

dims commented Apr 2, 2023

xmudrii commented Apr 2, 2023

BenTheElder commented Apr 2, 2023

tzneal commented Apr 3, 2023

TerryHowe commented Apr 3, 2023

xmudrii commented Apr 3, 2023

xmudrii commented Apr 3, 2023

xmudrii commented Apr 3, 2023

xmudrii commented Apr 25, 2023

ameukam commented Nov 15, 2023

xmudrii commented Nov 15, 2023

xmudrii commented Feb 12, 2024

eks-prow-build-cluster: Reconsider instance type selection #5066

eks-prow-build-cluster: Reconsider instance type selection #5066

Comments

tzneal commented Mar 30, 2023

tzneal commented Mar 30, 2023

xmudrii commented Apr 2, 2023

xmudrii commented Apr 2, 2023

BenTheElder commented Apr 2, 2023

xmudrii commented Apr 2, 2023

xmudrii commented Apr 2, 2023

dims commented Apr 2, 2023

xmudrii commented Apr 2, 2023

BenTheElder commented Apr 2, 2023

tzneal commented Apr 3, 2023

TerryHowe commented Apr 3, 2023

xmudrii commented Apr 3, 2023

xmudrii commented Apr 3, 2023

xmudrii commented Apr 3, 2023

xmudrii commented Apr 25, 2023

ameukam commented Nov 15, 2023

xmudrii commented Nov 15, 2023

xmudrii commented Feb 12, 2024