Avoid calico-etcd crashloop

Sometimes the calico-etcd pod crashloops when it is being bootstrapped.
This occurs intermittently in the gates.

Best guess .. when the etcd-anchor pod initially creates the etcd static
manifest, it waits for the anchor period (15 seconds) for the etcd pod
to become ready. If it is not ready, the next iteration through the loop
recreates an identical manifest. The fact that it is a new file causes
kubelet to terminate the original container and start up a new one.

Kubelet and the container runtime get out of sync, and kubelet can't
figure out the correct container id, so the pod ends up crashlooping
forever.  Manually removing and readding the manifest file doesn't
resolve the condition, although a kubelet restart actually does.

This "fix" will only write the updated manifest if it is different, and
hopefully will prevent the condition from occurring.

Change-Id: I4b6b1bf17fd8f0b36d24a741779505b38dba349f
This commit is contained in:
Phil Sphicas 2021-02-10 21:20:19 +00:00
parent 77c762463b
commit d161528ae8
1 changed files with 1 additions and 1 deletions

View File

@ -36,7 +36,7 @@ create_manifest () {
cp -f /anchor-etcd/{{ .Values.service.name }}.yaml $WIP
sed -i -e 's#_ETCD_INITIAL_CLUSTER_STATE_#'$2'#g' $WIP
sed -i -e 's#_ETCD_INITIAL_CLUSTER_#'$1'#g' $WIP
mv -f "$WIP" "$3"
sync_file "$WIP" "$3"
}
sync_configuration () {