Skip to content

Commit 673c872

Browse files
authored
Merge pull request #360 from cncf/feature/#322_node_failure
Add node failure resilience test cncf/cnf-conformance#322
2 parents 1ed69f5 + 4076747 commit 673c872

18 files changed

+426
-105
lines changed

TEST-CATEGORIES.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ The CNF Conformance program enables interoperability of CNFs from multiple vendo
7373

7474
## Resilience Tests
7575
[Cloud Native Definition](https://github.com/cncf/toc/blob/master/DEFINITION.md) requires systems to be Resilient to failures inevitable in cloud environments. CNF Resilience should be tested to ensure CNFs are designed to deal with non-carrier-grade shared cloud HW/SW platform:
76-
* For full failures in SW and HW platform: stopped cloud infrastructure/platform services, workload microservices or HW ingredients
76+
* For full failures in SW and HW platform: stopped cloud infrastructure/platform services, workload microservices or HW ingredients and nodes
7777
* For bursty, regular or partial impairments on key dependencies: CPU cycles by pausing, limiting or overloading; DPDK-based Dataplane networking by dropping and/or delaying packets.
7878
* Test if the CNF crashes when network loss occurs (Network Chaos)
7979

USAGE.md

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,11 @@ crystal build src/cnf-conformance.cr
2929
crystal src/cnf-conformance.cr all cnf-config=<path_to_your_config_file>/cnf-conformance.yml
3030
```
3131

32+
## Running all of the CNF Conformance tests (including proofs of concepts)
33+
```
34+
crystal src/cnf-conformance.cr all poc cnf-config=<path_to_your_config_file>/cnf-conformance.yml
35+
```
36+
3237
## Logging
3338

3439
```
@@ -328,13 +333,23 @@ crystal src/cnf-conformance.cr chaos_container_kill
328333
```
329334

330335
## Platform Tests
331-
#### (PoC) Run all platform tests
336+
#### Run all platform tests
332337
```
333338
crystal src/cnf-conformance.cr platform
334339
```
335-
#### (PoC) Run the K8s conformance tests
340+
#### Run the K8s conformance tests
336341
```
337342
crystal src/cnf-conformance.cr k8s_conformance
338343
```
344+
#### (PoC) Run All platform resilience tests
345+
```
346+
crystal src/cnf-conformance.cr resilience poc
347+
348+
```
349+
#### (PoC) Run node failure test **warning** this is a destructive test and will reboot your *host* node!
350+
#### Don't run this unless you have completely separate cluster (e.g. you are not running KIND on a dev box)
351+
```
352+
crystal src/cnf-conformance.cr node_failure poc destructive
353+
```
339354

340355

config.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,12 @@ toggles:
88
toggle_on: false
99
- name: beta
1010
toggle_on: false
11+
- name: poc
12+
toggle_on: false
13+
# Don't change this to true unless you know what you are doing
14+
# I.E. your cluster and host (your dev box if you are using
15+
# kind) will be changed, rebooted, chaos tested, etc
16+
- name: destructive
17+
toggle_on: false
1118
loglevel: info
1219

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
tolerations:
2+
- key: "node.kubernetes.io/unreachable"
3+
operator: "Exists"
4+
effect: "NoExecute"
5+
tolerationSeconds: 1
6+
- key: "node.kubernetes.io/not-ready"
7+
operator: "Exists"
8+
effect: "NoExecute"
9+
tolerationSeconds: 1
10+

points-all.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,7 @@
9898

9999
- name: k8s_conformance
100100
tags: platform, dynamic
101+
- name: node_failure
102+
tags: platform, dynamic
103+
- name: recover_from_node_failure
104+
tags: platform, dynamic

points.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -107,5 +107,9 @@
107107
#- name: performance
108108
# tags: hardware, dynamic
109109

110-
#- name: k8s_conformance
111-
# tags: platform, dynamic
110+
# - name: k8s_conformance
111+
# tags: platform, dynamic
112+
# - name: node_failure
113+
# tags: platform, dynamic
114+
# - name: recover_from_node_failure
115+
# tags: platform, dynamic

spec/platform/resilience_spec.cr

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
require "./../spec_helper"
2+
require "colorize"
3+
require "./../../src/tasks/utils/utils.cr"
4+
5+
describe "Platform" do
6+
before_all do
7+
# LOGGING.debug `pwd`
8+
# LOGGING.debug `echo $KUBECONFIG`
9+
`./cnf-conformance samples_cleanup`
10+
$?.success?.should be_true
11+
`./cnf-conformance setup`
12+
$?.success?.should be_true
13+
`./cnf-conformance sample_coredns_with_wait_setup`
14+
$?.success?.should be_true
15+
end
16+
it "'node_failure' should pass if chaos_mesh node_failure tests prove the platform is resilient" do
17+
if check_destructive
18+
puts "Tests running in destructive mode".colorize(:red)
19+
response_s = `./cnf-conformance platform:node_failure poc destructive`
20+
LOGGING.info response_s
21+
(/(PASSED: Node came back online)/ =~ response_s).should_not be_nil
22+
else
23+
response_s = `./cnf-conformance platform:node_failure poc`
24+
LOGGING.info response_s
25+
(/(PASSED: Nodes are resilient|Skipped)/ =~ response_s).should_not be_nil
26+
end
27+
end
28+
end
29+

spec/resilience_spec.cr

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@ require "sam"
77

88
describe "Resilience" do
99
before_all do
10-
# `./cnf-conformance samples_cleanup force=true`
11-
# $?.success?.should be_true
1210
`./cnf-conformance configuration_file_setup`
1311
$?.success?.should be_true
1412
end
@@ -18,6 +16,7 @@ describe "Resilience" do
1816
`./cnf-conformance cnf_setup cnf-config=sample-cnfs/sample-coredns-cnf/cnf-conformance.yml`
1917
$?.success?.should be_true
2018
response_s = `./cnf-conformance chaos_container_kill verbose`
19+
LOGGING.info response_s
2120
$?.success?.should be_true
2221
(/PASSED: Replicas available match desired count after container kill test/ =~ response_s).should_not be_nil
2322
ensure
@@ -31,6 +30,7 @@ describe "Resilience" do
3130
`./cnf-conformance cnf_setup cnf-path=sample-cnfs/sample-fragile-state deploy_with_chart=false`
3231
$?.success?.should be_true
3332
response_s = `./cnf-conformance chaos_container_kill verbose`
33+
LOGGING.info response_s
3434
$?.success?.should be_true
3535
(/FAILURE: Replicas did not return desired count after container kill test/ =~ response_s).should_not be_nil
3636
ensure
@@ -44,6 +44,7 @@ describe "Resilience" do
4444
`./cnf-conformance cnf_setup cnf-config=sample-cnfs/sample-coredns-cnf/cnf-conformance.yml`
4545
$?.success?.should be_true
4646
response_s = `./cnf-conformance chaos_network_loss verbose`
47+
LOGGING.info response_s
4748
$?.success?.should be_true
4849
(/PASSED: Replicas available match desired count after network chaos test/ =~ response_s).should_not be_nil
4950
ensure
@@ -57,6 +58,7 @@ describe "Resilience" do
5758
`./cnf-conformance cnf_setup cnf-path=sample-cnfs/sample_network_loss deploy_with_chart=false`
5859
$?.success?.should be_true
5960
response_s = `./cnf-conformance chaos_network_loss verbose`
61+
LOGGING.info response_s
6062
$?.success?.should be_true
6163
(/FAILURE: Replicas did not return desired count after network chaos test/ =~ response_s).should_not be_nil
6264
ensure
@@ -70,6 +72,7 @@ describe "Resilience" do
7072
`./cnf-conformance cnf_setup cnf-config=sample-cnfs/sample-coredns-cnf/cnf-conformance.yml`
7173
$?.success?.should be_true
7274
response_s = `./cnf-conformance chaos_cpu_hog verbose`
75+
LOGGING.info response_s
7376
$?.success?.should be_true
7477
(/PASSED: Application pod is healthy after high CPU consumption/ =~ response_s).should_not be_nil
7578
ensure

src/tasks/chaos_mesh_setup.cr

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
require "sam"
2+
require "file_utils"
3+
require "colorize"
4+
require "totem"
5+
require "./utils/utils.cr"
6+
7+
CHAOS_MESH_VERSION = "v0.8.0"
8+
9+
desc "Install Chaos Mesh"
10+
task "install_chaosmesh" do |_, args|
11+
VERBOSE_LOGGING.info "install_chaosmesh" if check_verbose(args)
12+
current_dir = FileUtils.pwd
13+
helm = "#{current_dir}/#{TOOLS_DIR}/helm/linux-amd64/helm"
14+
crd_install = `kubectl create -f https://raw.githubusercontent.com/chaos-mesh/chaos-mesh/#{CHAOS_MESH_VERSION}/manifests/crd.yaml`
15+
VERBOSE_LOGGING.info "#{crd_install}" if check_verbose(args)
16+
unless Dir.exists?("#{current_dir}/#{TOOLS_DIR}/chaos_mesh")
17+
# TODO use a tagged version
18+
fetch_chaos_mesh = `git clone https://github.com/chaos-mesh/chaos-mesh.git #{current_dir}/#{TOOLS_DIR}/chaos_mesh`
19+
checkout_tag = `cd #{current_dir}/#{TOOLS_DIR}/chaos_mesh && git checkout tags/#{CHAOS_MESH_VERSION} && cd -`
20+
end
21+
install_chaos_mesh = `#{helm} install chaos-mesh #{current_dir}/#{TOOLS_DIR}/chaos_mesh/helm/chaos-mesh --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock`
22+
wait_for_resource("#{current_dir}/spec/fixtures/chaos_network_loss.yml")
23+
wait_for_resource("#{current_dir}/spec/fixtures/chaos_cpu_hog.yml")
24+
wait_for_resource("#{current_dir}/spec/fixtures/chaos_container_kill.yml")
25+
end
26+
27+
desc "Uninstall Chaos Mesh"
28+
task "uninstall_chaosmesh" do |_, args|
29+
VERBOSE_LOGGING.info "uninstall_chaosmesh" if check_verbose(args)
30+
current_dir = FileUtils.pwd
31+
helm = "#{current_dir}/#{TOOLS_DIR}/helm/linux-amd64/helm"
32+
crd_delete = `kubectl delete -f https://raw.githubusercontent.com/chaos-mesh/chaos-mesh/#{CHAOS_MESH_VERSION}/manifests/crd.yaml`
33+
FileUtils.rm_rf("#{current_dir}/#{TOOLS_DIR}/chaos_mesh")
34+
delete_chaos_mesh = `#{helm} delete chaos-mesh`
35+
end
36+
37+
def wait_for_test(test_type, test_name)
38+
second_count = 0
39+
wait_count = 60
40+
status = ""
41+
until (status.empty? != true && status == "Finished") || second_count > wait_count.to_i
42+
LOGGING.debug "second_count = #{second_count}"
43+
sleep 1
44+
get_status = `kubectl get "#{test_type}" "#{test_name}" -o yaml`
45+
LOGGING.info("#{get_status}")
46+
status_data = Totem.from_yaml("#{get_status}")
47+
LOGGING.info "Status: #{get_status}"
48+
LOGGING.debug("#{status_data}")
49+
status = status_data.get("status").as_h["experiment"].as_h["phase"].as_s
50+
second_count = second_count + 1
51+
LOGGING.info "#{get_status}"
52+
LOGGING.info "#{second_count}"
53+
end
54+
# Did chaos mesh finish the test successfully
55+
(status.empty? !=true && status == "Finished")
56+
end
57+
58+
def desired_is_available?(deployment_name)
59+
resp = `kubectl get deployments #{deployment_name} -o=yaml`
60+
describe = Totem.from_yaml(resp)
61+
LOGGING.info("desired_is_available describe: #{describe.inspect}")
62+
desired_replicas = describe.get("status").as_h["replicas"].as_i
63+
LOGGING.info("desired_is_available desired_replicas: #{desired_replicas}")
64+
ready_replicas = describe.get("status").as_h["readyReplicas"]?
65+
unless ready_replicas.nil?
66+
ready_replicas = ready_replicas.as_i
67+
else
68+
ready_replicas = 0
69+
end
70+
LOGGING.info("desired_is_available ready_replicas: #{ready_replicas}")
71+
72+
desired_replicas == ready_replicas
73+
end
74+
75+
def wait_for_resource(resource_file)
76+
second_count = 0
77+
wait_count = 60
78+
is_resource_created = nil
79+
until (is_resource_created.nil? != true && is_resource_created == true) || second_count > wait_count.to_i
80+
LOGGING.info "second_count = #{second_count}"
81+
sleep 3
82+
`kubectl create -f #{resource_file} 2>&1 >/dev/null`
83+
is_resource_created = $?.success?
84+
LOGGING.info "Waiting for CRD"
85+
LOGGING.info "Status: #{is_resource_created}"
86+
LOGGING.debug "resource file: #{resource_file}"
87+
second_count = second_count + 1
88+
end
89+
`kubectl delete -f #{resource_file}`
90+
end

src/tasks/platform/resilience.cr

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# coding: utf-8
2+
require "sam"
3+
require "colorize"
4+
require "../utils/utils.cr"
5+
6+
namespace "platform" do
7+
desc "The CNF conformance suite checks to see if the CNFs are resilient to failures."
8+
task "resilience", ["node_failure"] do |t, args|
9+
VERBOSE_LOGGING.info "resilience" if check_verbose(args)
10+
VERBOSE_LOGGING.debug "resilience args.raw: #{args.raw}" if check_verbose(args)
11+
VERBOSE_LOGGING.debug "resilience args.named: #{args.named}" if check_verbose(args)
12+
stdout_score("resilience")
13+
end
14+
15+
desc "Does the Platform recover the node and reschedule pods when a worker node fails"
16+
task "node_failure" do |_, args|
17+
unless check_poc(args) && check_destructive(args)
18+
LOGGING.info "skipping node_failure: not in POC and destructive mode"
19+
puts "Skipped".colorize(:yellow)
20+
next
21+
end
22+
LOGGING.info "Running POC in destructive mode!"
23+
task_response = task_runner(args) do |args|
24+
current_dir = FileUtils.pwd
25+
helm = "#{current_dir}/#{TOOLS_DIR}/helm/linux-amd64/helm"
26+
27+
#Select the first node that isn't a master and is also schedulable
28+
worker_nodes = `kubectl get nodes --selector='!node-role.kubernetes.io/master' -o 'go-template={{range .items}}{{$taints:=""}}{{range .spec.taints}}{{if eq .effect "NoSchedule"}}{{$taints = print $taints .key ","}}{{end}}{{end}}{{if not $taints}}{{.metadata.name}}{{ "\\n"}}{{end}}{{end}}'`
29+
worker_node = worker_nodes.split("\n")[0]
30+
31+
32+
File.write("node_failure_values.yml", NODE_FAILURE_VALUES)
33+
install_coredns = `#{helm} install node-failure -f ./node_failure_values.yml --set nodeSelector."kubernetes\\.io/hostname"=#{worker_node} stable/coredns`
34+
wait_for_install("node-failure-coredns")
35+
36+
37+
File.write("reboot_daemon_pod.yml", REBOOT_DAEMON)
38+
install_reboot_daemon = `kubectl create -f reboot_daemon_pod.yml`
39+
wait_for_install("node-failure-coredns")
40+
41+
pod_ready = ""
42+
pod_ready_timeout = 45
43+
begin
44+
until (pod_ready == "true" || pod_ready_timeout == 0)
45+
pod_ready = pod_status("reboot", "--field-selector spec.nodeName=#{worker_node}").split(",")[2]
46+
pod_ready_timeout = pod_ready_timeout - 1
47+
if pod_ready_timeout == 0
48+
upsert_failed_task("recover_from_node_failure", "✖️ FAILURE: Failed to install reboot daemon")
49+
exit 1
50+
end
51+
sleep 1
52+
puts "Waiting for reboot daemon to be ready"
53+
puts "Reboot Daemon Ready Status: #{pod_ready}"
54+
end
55+
56+
# Find Reboot Daemon name
57+
reboot_daemon_pod = pod_status("reboot", "--field-selector spec.nodeName=#{worker_node}").split(",")[0]
58+
start_reboot = `kubectl exec -ti #{reboot_daemon_pod} touch /tmp/reboot`
59+
60+
#Watch for Node Failure.
61+
pod_ready = ""
62+
node_ready = ""
63+
node_failure_timeout = 30
64+
until (pod_ready == "false" || node_ready == "False" || node_ready == "Unknown" || node_failure_timeout == 0)
65+
pod_ready = pod_status("node-failure").split(",")[2]
66+
node_ready = node_status("#{worker_node}")
67+
puts "Waiting for Node to go offline"
68+
puts "Pod Ready Status: #{pod_ready}"
69+
puts "Node Ready Status: #{node_ready}"
70+
node_failure_timeout = node_failure_timeout - 1
71+
if node_failure_timeout == 0
72+
upsert_failed_task("recover_from_node_failure", "✖️ FAILURE: Node failed to go offline")
73+
exit 1
74+
end
75+
sleep 1
76+
end
77+
78+
#Watch for Node to come back online
79+
pod_ready = ""
80+
node_ready = ""
81+
node_online_timeout = 300
82+
until (pod_ready == "true" && node_ready == "True" || node_online_timeout == 0)
83+
pod_ready = pod_status("node-failure", "").split(",")[2]
84+
node_ready = node_status("#{worker_node}")
85+
puts "Waiting for Node to come back online"
86+
puts "Pod Ready Status: #{pod_ready}"
87+
puts "Node Ready Status: #{node_ready}"
88+
node_online_timeout = node_online_timeout - 1
89+
if node_online_timeout == 0
90+
upsert_failed_task("recover_from_node_failure", "✖️ FAILURE: Node failed to come back online")
91+
exit 1
92+
end
93+
sleep 1
94+
end
95+
96+
emoji_chaos_network_loss="📶☠️"
97+
resp = upsert_passed_task("recover_from_node_failure","✔️ PASSED: Node came back online #{emoji_chaos_network_loss}")
98+
99+
100+
ensure
101+
LOGGING.info "node_failure cleanup"
102+
delete_reboot_daemon = `kubectl delete -f reboot_daemon_pod.yml`
103+
delete_coredns = `#{helm} delete node-failure`
104+
File.delete("reboot_daemon_pod.yml")
105+
File.delete("node_failure_values.yml")
106+
end
107+
end
108+
end
109+
end

0 commit comments

Comments
 (0)