Skip to content

Latest commit

 

History

History
94 lines (76 loc) · 3.35 KB

night-manual.md

File metadata and controls

94 lines (76 loc) · 3.35 KB

Night manual

Make sure to also check troubleshooting docs.

Before you start, record the current state of the cluster:

kubectl cluster-info dump > cluster-dump.json

Application problems

The steps and subsections here are in order of severity, starting with mild actions that have minimal impact on the workload/cluster.

Try killing problematic pods to get fresh ones:

kubectl delete pods <name> -n <namespace>

Helm charts

Set up helm flags to access the correct tiller:

source scripts/helm-env.sh <namespace> prod-certs/cluster/certs <admin-name>
# Example
# source scripts/helm-env.sh kube-system prod-certs/cluster/certs lennart
# see that it works
helm list

Roll back helm releases:

# show current release(s)
helm list
# check the history of revisions
helm history <release>
# roll back to previous revision
helm rollback <release> 0
# roll back to specific revision
helm rollback <release> <revision>

Backup current values:

helm get values <release> > backup-values.yaml

If the above doesn't work, check at least the image tag(s) used:

kubectl -n <namespace> get deploy -o wide

If that is also impossible, check the latest CI job to see what image tag was used or check the container registry for a suitable image to use.

Delete helm chart:

helm delete <release>
# Alternatively, if the release is in a bad state:
helm delete --purge <release>

Warning: the above will lead to downtime since the application is completely removed from the cluster. Note that the second command will completely clear the revision history.

Reinstall by rerunning a CI job OR manually:

helm upgrade --install <release> <chart> --namespace <namespace> -f backup-values.yaml
# Example:
# helm upgrade --install app1 app-chart --namespace app-namespace -f backup-values.yaml

If you were unable to do a backup of the values, check the documentation for the application on how to set the correct image tag.

Completely recreate a namespace

Note: before you do this, make sure to backup current values as described above.

Check the application specific documentation for each application. In short you will have to delete the namespace (kubectl delete namespace <namespace>) and then re-initialize it for the troublesome application. This will most likely require re-running the post-install.yaml playbook and/or a CI/CD pipeline job and/or manual initialization of resources the application depends on.

To reinstall the chart(s), set up helm flags and reinstall as described above.

Node problem

If a node is having problems, try adding new nodes (using terraform), scaling up the cluster and draining the problematic node.

kubectl drain <node-name> --ignore-daemonsets --delete-local-data

This should be enough to move almost everything from the node. Only daemonsets will still run pods there.

To completely remove the node, you also have to remove it from the cluster and delete the virtual machine. See the kubespray documentation for how to remove the node from the cluster. Remember to also delete the virtual machine using terraform.

# remove the virtual machine from terraform code and then apply:
terraform apply