From 8003f113819ed07fe4997b0d33e72624cccdafaa Mon Sep 17 00:00:00 2001
From: Mark Burnett <mark.m.burnett@gmail.com>
Date: Mon, 27 Nov 2017 09:49:01 -0600
Subject: [PATCH] Doc: describe pod checkpointer approach

This documents why the Anchor pattern was created rather than sticking
to the bootkube project's pod-checkpointer.

Change-Id: Ia396dcb5aa4f3b2275eb44e7db6b1f77af18154c
---
 docs/source/design.rst | 48 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/docs/source/design.rst b/docs/source/design.rst
index 68845c92..3438186a 100644
--- a/docs/source/design.rst
+++ b/docs/source/design.rst
@@ -184,6 +184,54 @@ on the nodes' file systems, but, e.g. in the case of Etcd_, can actually be
 used to manage more complex cleanup like removal from cluster membership.
 
 
+Pod Checkpointer
+~~~~~~~~~~~~~~~~
+
+Before moving to the Anchor pattern above, the pod-checkpointer approach
+pioneered by the Bootkube_ project was implemented.  While this is an appealing
+approach, it unfortunately suffers from race conditions during full cluster
+reboot.
+
+During cluster reboot, the checkpointer copies essential static manifests into
+place for the ``kubelet`` to run, which allows those components to start and
+become available.  Once the ``apiserver`` and ``etcd`` cluster are functional,
+``kubelet`` is able to register the failure of its workloads, and delete those
+pods via the API.  This is where the race begins.
+
+Once those pods are deleted from the ``apiserver``, the pod checkpointer
+notices that the flagged pods are no longer scheduled to run on its node and
+then deletes the static manifests for those pods.  Concurrently, the
+``controller-manager`` and ``scheduler`` notice that new pods need to be
+created and scheduled (sequentially) and begin that work.
+
+If the new pods are created, scheduled and started on the node before pod
+checkpointers on other nodes delete their critical services, then the cluster
+may remain healthy after the reboot.  If enough nodes running the critical
+services fail to start the newly created pods before too many are removed, then
+the cluster does not recover from hard reboot.
+
+The severity of this race is exacerbated by:
+
+1. The sequence of events required to successfully replace these pods is long
+   (``controller-manager`` must create pods, then ``scheduler`` can schedule
+   pods, then ``kubelet`` can start pods).
+2. The ``controller-manager`` and ``scheduler`` may need to perform leader
+   election during the race, because the leader might have been killed early.
+3. The failure to recover any one of the core sets of processes can cause the
+   entire cluster to fail.  This is somewhat trajectory-dependent, e.g. if at
+   least one ``controller-manager`` is scheduled before the
+   ``controller-manager`` processes are all killed, then assuming the other
+   processes are correctly restarted, then the ``controller-manager`` will also
+   recover.
+4. ``etcd`` is somewhat more sensitive to this race, because it requires two
+   successfully restarted pods (assuming a 3 node cluster) rather than just one
+   as the other components.
+
+This race condition was the motivation for the construction and use of the
+Anchor pattern.  In future versions of Kubernetes_, it may be possible to use
+`built-in checkpointing <https://docs.google.com/document/d/1hhrCa_nv0Sg4O_zJYOnelE8a5ClieyewEsQM6c7-5-o/view#>`_ from the ``kubelet``.
+
+
 Alternatives
 ------------