-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A dedicated_cores pod do not have an exclusive CPU #608
Comments
@flpanbin 请问是在什么模式下运行的?(QRM or ORM) 另外可以提供下节点的NUMA信息吗?
在 |
QRM模式下运行的,节点的numa信息: root@ubuntu:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 32145 MB
node 0 free: 30247 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 32248 MB
node 1 free: 30001 MB
node distances:
node 0 1
0: 10 20
1: 20 10 |
@WangZzzhe 看了下对应节点的日志,有错误提示:err: rpc error: code = Unknown desc = hint is empty,不知道和这个是否有关系。 I0606 01:28:02.100303 1 manager.go:417] [ORM] addContainer, pod: numa-dedicated-normal-pod, container: stress
W0606 01:28:02.100337 1 manager.go:488] [ORM] pod: default/numa-dedicated-normal-pod; container: stress allocate resource: cpu without numa nodes affinity
I0606 01:28:02.101041 1 policy.go:683] "[katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).Allocate] called" podNamespace="default" podName="numa-dedicated-normal-pod" containerName="stress" podType="" podRole="" containerType="MAIN" qosLevel="dedicated_cores" numCPUsInt=1 numCPUsFloat64=1 isDebugPod=false
E0606 01:28:02.101136 1 policy_allocation_handlers.go:304] "[katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).dedicatedCoresWithNUMABindingAllocationHandler] unable to allocate CPUs" err="hint is empty" podNamespace="default" podName="numa-dedicated-normal-pod" containerName="stress" numCPUsInt=1 numCPUsFloat64=1
E0606 01:28:02.101727 1 manager.go:501] [ORM] addContainer allocate fail, pod numa-dedicated-normal-pod, container stress, err: rpc error: code = Unknown desc = hint is empty
E0606 01:28:02.101786 1 manager.go:605] [ORM] re addContainer fail, pod numa-dedicated-normal-pod container stress, err: [ORM] addContainer allocate fail, pod numa-dedicated-normal-pod, container stress, err: rpc error: code = Unknown desc = hint is empty
I0606 01:28:02.102672 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/.3396218561\": CREATE"
I0606 01:28:02.102733 1 plugin_watcher.go:174] "Ignoring file (starts with '.')" path=".3396218561"
I0606 01:28:02.105100 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/kubelet_qrm_checkpoint\": CREATE"
I0606 01:28:02.105244 1 plugin_watcher.go:184] "Ignoring non socket file" path="kubelet_qrm_checkpoint"
I0606 01:28:02.216433 1 provisioner.go:84] [malachite] heartbeat
I0606 01:28:02.217117 1 round_trippers.go:466] curl -v -XGET -H "Accept: application/json, */*" -H "User-Agent: katalyst-agent/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://10.6.202.153:10250/stats/summary?timeout=10s'
I0606 01:28:02.218075 1 round_trippers.go:553] GET https://10.6.202.153:10250/stats/summary?timeout=10s 403 Forbidden in 0 milliseconds
I0606 01:28:02.218100 1 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 0 ms Duration 0 ms
I0606 01:28:02.218192 1 round_trippers.go:577] Response Headers:
I0606 01:28:02.218217 1 round_trippers.go:580] Content-Type: text/plain; charset=utf-8
I0606 01:28:02.218231 1 round_trippers.go:580] Content-Length: 114
I0606 01:28:02.218242 1 round_trippers.go:580] Date: Thu, 06 Jun 2024 01:28:02 GMT
I0606 01:28:02.218272 1 request.go:1154] Response Body: Forbidden (user=system:serviceaccount:katalyst-system:katalyst-agent, verb=get, resource=nodes, subresource=stats)
E0606 01:28:02.218319 1 provisioner.go:65] failed to update stats/summary from kubelet: "failed to get kubelet config for summary api, error: Forbidden (user=system:serviceaccount:katalyst-system:katalyst-agent, verb=get, resource=nodes, subresource=stats)"
I0606 01:28:02.234127 1 pod.go:206] get metric mem.usage.container for pod numa-dedicated-normal-pod, collect time 2024-06-06 01:28:01 +0000 UTC, left len 3
I0606 01:28:02.234210 1 pod.go:206] get metric cpu.load.1min.container for pod numa-dedicated-normal-pod, collect time 2024-06-06 01:28:01 +0000 UTC, left len 2
I0606 01:28:02.234227 1 pod.go:206] get metric cpu.usage.container for pod numa-dedicated-normal-pod, collect time 2024-06-06 01:28:01 +0000 UTC, left len 1 |
@WangZzzhe template:
metadata:
annotations:
katalyst.kubewharf.io/qos_level: system_cores
creationTimestamp: null
labels:
app: katalyst-agent
app.kubernetes.io/instance: katalyst-colocation
app.kubernetes.io/name: katalyst-agent
spec:
containers:
- args:
- --plugin-registration-dir=/var/lib/katalyst/plugin-socks
- --checkpoint-manager-directory=/var/lib/katalyst/plugin-checkpoint
- --locking-file=/tmp/katalyst_colocation_katalyst_agent_lock
- --node-name=$(MY_NODE_NAME)
- --node-address=$(MY_NODE_ADDRESS)
- --agents=*
- --cpu-resource-plugin-advisor=true
- --enable-cpu-pressure-eviction=true
- --enable-kubelet-secure-port=true
- --enable-reclaim=true
- --enable-report-topology-policy=true
- --eviction-plugins=*
- --memory-resource-plugin-advisor=true
- --orm-devices-provider=kubelet
- --orm-kubelet-pod-resources-endpoints=/var/lib/kubelet/pod-resources/kubelet.sock
- --orm-resource-names-map=resource.katalyst.kubewharf.io/reclaimed_millicpu=cpu,resource.katalyst.kubewharf.io/reclaimed_memory=memory
- --pod-resources-server-endpoint=/var/lib/katalyst/pod-resources/kubelet.sock
- --qrm-socket-dirs=/var/lib/katalyst/plugin-socks
- --topology-policy-name=none
- --v=9
command:
- katalyst-agent 系统版本信息: root@ubuntu:~# uname -a
Linux ubuntu 5.4.0-182-generic #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
@flpanbin |
@WangZzzhe 设置了 root@ubuntu:~/katalyst# ps -ef | grep katalyst-agent
root 2423499 2423425 15 Jun05 ? 01:31:28 katalyst-metric --leader-elect-resource-name=katalyst-colocation-katalyst-metric --leader-elect-resource-namespace=katalyst-system --collector-pod-selector=app=katalyst-agent
root 2604106 2604047 3 02:59 ? 00:00:05 katalyst-agent --plugin-registration-dir=/var/lib/katalyst/plugin-socks --checkpoint-manager-directory=/var/lib/katalyst/plugin-checkpoint --locking-file=/tmp/katalyst_colocation_katalyst_agent_lock --node-name=node1 --node-address=10.6.202.152 --agents=* --cpu-resource-plugin-advisor=true --enable-cpu-pressure-eviction=true --enable-kubelet-secure-port=true --enable-reclaim=true --enable-report-topology-policy=true --eviction-plugins=* --memory-resource-plugin-advisor=true --orm-devices-provider=kubelet --orm-kubelet-pod-resources-endpoints=/var/lib/kubelet/pod-resources/kubelet.sock --orm-resource-names-map=resource.katalyst.kubewharf.io/reclaimed_millicpu=cpu,resource.katalyst.kubewharf.io/reclaimed_memory=memory --pod-resources-server-endpoint=/var/lib/katalyst/pod-resources/kubelet.sock --qrm-socket-dirs=/var/lib/katalyst/plugin-socks --topology-policy-name=best-effort --v=9 |
@WangZzzhe 请问这个有什么定位思路吗?还需要提供哪些信息来定位问题呢? root@ubuntu:~/katalyst/examples# kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.6.202.151 Ready control-plane 13d v1.24.6-kubewharf.8
node1 Ready <none> 13d v1.24.6-kubewharf.8
node2 Ready <none> 13d v1.24.6-kubewharf.8
root@ubuntu:~/katalyst/examples# containerd -v
containerd github.com/containerd/containerd v1.4.12 7b11cfaabd73bb80907dd23182b9347b4245eb5d |
@flpanbin |
但是为什么我查看容器的 cpuset 和 还是显示的 0-47呢,按理说应该是分配了24core root@ubuntu:~/katalyst# cat /sys/fs/cgroup/cpuset/kubepods/podb077c70f-6103-43f9-ba77-64d67ec736ba/cpuset.cpus
0-47
root@ubuntu:~/katalyst# cat /sys/fs/cgroup/cpuset/kubepods/podb077c70f-6103-43f9-ba77-64d67ec736ba/80c10c4e45cbe275bafac9cf8b74ef72e6c0b9565d33389fce86f6e7c3737843/cpuset.cpus
0-47
root@ubuntu:~/katalyst# cat /sys/fs/cgroup/cpuset/kubepods/podb077c70f-6103-43f9-ba77-64d67ec736ba/604fc985574b2ff117630b699dd4e9ac77c95d316b6ece904385694db0090268/cpuset.cpus
0-47 |
|
syncContainer 没有看到有错误日志,但是看有日志显示分配的结果是 0-47:
I0606 06:28:10.191773 1 manager.go:536] [ORM] reconcile...
I0606 06:28:10.193289 1 policy.go:406] [katalyst-core/pkg/agent/qrm-plugins/cpu/dynamicpolicy.(*DynamicPolicy).GetResourcesAllocation] called
I0606 06:28:10.194287 1 manager.go:550] [ORM] skip getResourceAllocation of resource: resource.katalyst.kubewharf.io/net_bandwidth, because plugin needn't reconciling
I0606 06:28:10.194349 1 manager.go:519] [ORM] syncContainer, pod: katalyst-colocation-katalyst-agent-h6gj8, container: katalyst-agent
I0606 06:28:10.194375 1 manager.go:522] got pod katalyst-colocation-katalyst-agent-h6gj8 container katalyst-agent resources nil
I0606 06:28:10.194450 1 manager.go:519] [ORM] syncContainer, pod: katalyst-colocation-katalyst-metric-85c47ff4bf-7lw4g, container: katalyst-metric
I0606 06:28:10.194472 1 manager.go:522] got pod katalyst-colocation-katalyst-metric-85c47ff4bf-7lw4g container katalyst-metric resources nil
I0606 06:28:10.194550 1 manager.go:630] [ORM] allocation information for resources memory - accompanying resource: memory for pod: default/dedicated-normal-pod2, container: stress is {CpusetMems false true 6.7522060288e+10 0-1 map[] map[] nil {} 0}
I0606 06:28:10.194723 1 manager.go:630] [ORM] allocation information for resources cpu - accompanying resource: cpu for pod: default/dedicated-normal-pod2, container: stress is {CpusetCpus false true 48 0-47 map[] map[] nil {} 0}
I0606 06:28:10.194822 1 manager.go:519] [ORM] syncContainer, pod: dedicated-normal-pod2, container: stress
I0606 06:28:10.195592 1 manager.go:519] [ORM] syncContainer, pod: malachite-fvp5p, container: malachite
I0606 06:28:10.195679 1 manager.go:522] got pod malachite-fvp5p container malachite resources nil
I0606 06:28:10.195707 1 manager.go:654] [ORM] map resource name: resource.katalyst.kubewharf.io/reclaimed_millicpu to cpu
I0606 06:28:10.195784 1 manager.go:654] [ORM] map resource name: resource.katalyst.kubewharf.io/reclaimed_memory to memory
I0606 06:28:10.195804 1 manager.go:630] [ORM] allocation information for resources memory - accompanying resource: memory for pod: default/reclaimed-large-pod-node1, container: stress is {CpusetMems false true 0 0-1 map[] map[] nil {} 0}
I0606 06:28:10.195919 1 manager.go:654] [ORM] map resource name: resource.katalyst.kubewharf.io/reclaimed_millicpu to cpu
I0606 06:28:10.195988 1 manager.go:630] [ORM] allocation information for resources cpu - accompanying resource: cpu for pod: default/reclaimed-large-pod-node1, container: stress is {CpusetCpus false true 40 4-23,28-47 map[] map[] nil {} 0}
I0606 06:28:10.196018 1 manager.go:519] [ORM] syncContainer, pod: reclaimed-large-pod-node1, container: stress
I0606 06:28:10.197415 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/.2613201109\": CREATE"
I0606 06:28:10.197480 1 plugin_watcher.go:174] "Ignoring file (starts with '.')" path=".2613201109"
I0606 06:28:10.199087 1 plugin_watcher.go:160] "Handling create event" event="\"/var/lib/katalyst/plugin-socks/kubelet_qrm_checkpoint\": CREATE"
I0606 06:28:10.199115 1 plugin_watcher.go:184] "Ignoring non socket file" path="kubelet_qrm_checkpoint"
I0606 06:28:10.287012 1 manager.go:312] genericSync
I0606 06:28:10.288051 1 manager.go:387] "GetReportContent" costs="928.548µs" pluginName="headroom-reporter-plugin"
I0606 06:28:10.288224 1 manager.go:387] "GetReportContent" costs="121.926µs" pluginName="system-reporter-plugin"
I0606 06:28:10.288510 1 round_trippers.go:466] curl -v -XGET -H "Accept: application/json, */*" -H "User-Agent: katalyst-agent/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <masked>" 'https://10.6.202.152:10250/pods?timeout=10s' |
{
"policyName": "dynamic",
"machineState": {
"0": {
"default_cpuset": "",
"allocated_cpuset": "0-23",
"pod_entries": {
"754dc697-7658-47cc-8faa-0280c2931925": {
"stress": {
"pod_uid": "754dc697-7658-47cc-8faa-0280c2931925",
"pod_namespace": "default",
"pod_name": "dedicated-normal-pod2",
"container_name": "stress",
"container_type": "MAIN",
"owner_pool_name": "dedicated",
"allocation_result": "0-23",
"original_allocation_result": "0-23",
"topology_aware_assignments": {
"0": "0-23"
},
"original_topology_aware_assignments": {
"0": "0-23"
},
"init_timestamp": "2024-06-06 05:09:27.902883009 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores",
"numa_binding": "true",
--
"original_topology_aware_assignments": {
"0": "0-19"
},
"init_timestamp": "2024-06-06 05:58:40.67405812 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"qosLevel": "reclaimed_cores",
"request_quantity": 42000
}
}
}
},
"1": {
"default_cpuset": "",
"allocated_cpuset": "24-47",
"pod_entries": {
"754dc697-7658-47cc-8faa-0280c2931925": {
"stress": {
"pod_uid": "754dc697-7658-47cc-8faa-0280c2931925",
"pod_namespace": "default",
"pod_name": "dedicated-normal-pod2",
"container_name": "stress",
"container_type": "MAIN",
"owner_pool_name": "dedicated",
"allocation_result": "24-47",
"original_allocation_result": "24-47",
"topology_aware_assignments": {
"1": "24-47"
},
"original_topology_aware_assignments": {
"1": "24-47"
},
"init_timestamp": "2024-06-06 05:09:27.902883009 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores",
"numa_binding": "true",
--
"1": "24-43"
},
"original_topology_aware_assignments": {
"1": "24-43"
},
"init_timestamp": "2024-06-06 05:58:40.67405812 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"annotations": {
"katalyst.kubewharf.io/qos_level": "reclaimed_cores"
},
"qosLevel": "reclaimed_cores",
"request_quantity": 42000
}
}
}
}
},
"pod_entries": {
"754dc697-7658-47cc-8faa-0280c2931925": {
"stress": {
"pod_uid": "754dc697-7658-47cc-8faa-0280c2931925",
"pod_namespace": "default",
"pod_name": "dedicated-normal-pod2",
"container_name": "stress",
"container_type": "MAIN",
"owner_pool_name": "dedicated",
"allocation_result": "0-47",
"original_allocation_result": "0-47",
"topology_aware_assignments": {
"0": "0-23",
"1": "24-47"
},
"original_topology_aware_assignments": {
"0": "0-23",
"1": "24-47"
},
"init_timestamp": "2024-06-06 05:09:27.902883009 +0000 UTC",
"labels": {
"katalyst.kubewharf.io/qos_level": "dedicated_cores"
},
"annotations": {
... |
What happened?
I created a dedicated_cores pod, but it not have an exclusive CPU.
dedicated_cores_pod.yaml:
check the cpuset for the pod:
What did you expect to happen?
The dedicated_cores pod should be allocated an exclusive CPU core.
How can we reproduce it (as minimally and precisely as possible)?
Create a dedicated cores pod, like dedicated_cores_pod.yaml, as mentioned above.
Software version
The text was updated successfully, but these errors were encountered: