From a57158d0e940358318cf6f8514109a7c7033bd2b Mon Sep 17 00:00:00 2001 From: Phil Sphicas Date: Sat, 13 Feb 2021 06:45:59 +0000 Subject: [PATCH] Disable kubernetes-etcd anchor cleanup in gates Interesting gate failure: * kubernetes-etcd chart is installed * kubernetes-etcd-anchor pod creates a new kubernetes-etcd manifest * kubernetes-etcd pod restarts * an etcd leader election happens, triggering a tiller failure * tiller tries to purge/delete the chart * the kubernetes-etcd-anchor can't terminate, because the preStop gets stuck in a loop trying to talk to etcd via the service endpoint, and the termination grace period is 3600s This change just takes the approach of disabling the cleanup for the kubernetes etcd anchor pod. An alternative fix is to change the grace period to something shorter. However, at this point, the haproxy anchor and kube-apiserver anchor pods have done their jobs, so kube-apiserver is talking to etcd via haproxy, and haproxy only knows about the kubernetes-etcd pod, not the auxiliary etcd pods. It is likely that the kubernetes-etcd anchor would restart and spin up a new kubernetes etcd pod in time, but it may occasionally fail. Change-Id: Ifa71394b2f87e227a6c4ad1b4c80900cec6f5684 --- examples/containerd/armada-resources.yaml | 1 + examples/gate/armada-resources.yaml | 1 + 2 files changed, 2 insertions(+) diff --git a/examples/containerd/armada-resources.yaml b/examples/containerd/armada-resources.yaml index 59269cdf..d84a6779 100644 --- a/examples/containerd/armada-resources.yaml +++ b/examples/containerd/armada-resources.yaml @@ -857,6 +857,7 @@ data: values: anchor: etcdctl_endpoint: kubernetes-etcd.kube-system.svc.cluster.local + enable_cleanup: false labels: anchor: node_selector_key: kubernetes-etcd diff --git a/examples/gate/armada-resources.yaml b/examples/gate/armada-resources.yaml index ac16e446..858bb3a2 100644 --- a/examples/gate/armada-resources.yaml +++ b/examples/gate/armada-resources.yaml @@ -863,6 +863,7 @@ data: values: anchor: etcdctl_endpoint: kubernetes-etcd.kube-system.svc.cluster.local + enable_cleanup: false labels: anchor: node_selector_key: kubernetes-etcd