8.7 KiB

Raw Blame History

Design

Promenade is a Kubernetes cluster deployment tool with the following goals:

Resiliency in the face of node loss and full cluster reboot.
Bare metal node support without external runtime dependencies.
Providing a fully functional single-node cluster to allow cluster-hosted tooling to provision the remaining cluster nodes.
Helm chart managed component life-cycle.
API-managed cluster life-cycle.

Cluster Bootstrapping

The cluster is bootstrapped on a single node, called the genesis node. This node goes through a short-lived bootstrapping phase driven by static pod manifests consumed by kubelet, then quickly moves to chart-managed infrastructure, driven by Armada.

During the bootstrapping phase, the following temporary components are run as static pods which are configured directly from Promenade's configuration documents:

Kubernetes core components
- apiserver
- controller-manager
- scheduler
Etcd for use by the Kubernetes apiserver
Helm's server process tiller
CoreDNS to be used for Kubernetes apiserver discovery

With these components up, it is possible to leverage Armada to deploy Helm charts to manage these components (and additional components) going forward.

Though completely configurable, a typical Armada manifest should specify charts for:

Kubernetes components
- apiserver
- controller-manager
- proxy
- scheduler
Cluster DNS (e.g. CoreDNS)
Etcd for use by the Kubernetes apiserver
A CNI provider for Kubernetes (e.g. Calico)
An initial under-cloud system to allow cluster expansion, including components like Armada, Deckhand, Drydock and Shipyard.

Once these charts are deployed, the cluster is validated (currently, validation is limited to resolving DNS queries and verifying basic Kubernetes functionality including Pod scheduling log collection), and then the genesis process is complete. Additional nodes can be added to the cluster using day 2 procedures.

After additional master nodes are added to the cluster, it is possible to remove the genesis node from the cluster so that it can be fully re-provisioned using the same process as for all the other nodes.

Life-cycle Management

There are two sets of resources that require life-cycle management: cluster nodes and Kubernetes control plane components. These two sets of resources are managed differently.

Node Life-Cycle Management

Node life-cycle management tools are provided via an API to be consumed by other tools like Drydock and Shipyard.

The life-cycle operations for nodes are:

Adding a node to the cluster
Removing a node from the cluster
Adding and removing node labels.

Adding a node to the cluster

Adding a node to the cluster is done by running a shell script on the node that installs the kubelet and configures it to find and join the cluster. This script can either be generated up front via the CLI, or it can be obtained via the join-scripts endpoint of the API (development of this API is in-progress).

Nodes can only be joined assuming all the proper configuration documents are available, including required certificates for Kubelet.

Removing a node from the cluster

This is currently possible by leveraging the promenade-teardown script placed on each host. API support for this function is planned, but not yet implemented.

Adding and removing node labels

This is currently only possible directly via kubectl, though API support for this functionality is planned.

It through relabeling nodes that key day 2 operations functionality like moving a master node are achieved.

Control-Plane Component Life-Cycle Management

With the exception of the Docker daemon and the kubelet, life-cycle management of control plane components is handled via Helm chart updates, which are orchestrated by Armada.

The Docker daemon is managed as an APT package, with configuration installed at the time the node is configured to join the cluster.

The kubelet is directly installed and configured at the time nodes join the cluster. Work is in progress to improve the upgradability of kubelet via either a system package or a chart.

Resiliency

The two primary failure scenarios Promenade is designed to be resilient against are node loss and full cluster restart.

Kubernetes has a well-defined High Availability pattern, which deals well with node loss.

However, this pattern requires an external load balancer for apiserver discovery. Since it is a goal of this project for the cluster to be able to operate without ongoing external dependencies, we must avoid that requirement.

Additionally, in the event of full cluster restart, we cannot rely on any response from the apiserver to give any kubelet direction on what processes to run. That means, each master node must be self-sufficient, so that once a quorum of Etcd members is achieved the cluster may resume normal operation.

The solution approach is two-pronged:

Deploy a local discovery mechanism for the apiserver processes on each node so that core components can always find the apiservers when their nodes reboot.
Apply the Anchor pattern described below to ensure that essential components on master nodes restart even when the apiservers are not available.

Currently, the discovery mechanism for the apiserver processes is provided by CoreDNS via a zone file written to disk on each node. This approach has some drawbacks, which might be addressed in future work by leveraging a HAProxy for discovery instead.

Anchor Pattern

The anchor pattern provides a way to manage process life-cycle using Helm charts in a way that allows them to be restarted immediately in the event of a node restart -- even when the Kubernetes apiserver is unreachable.

In this pattern, a DaemonSet called the anchor that runs on selected nodes and is responsible for managing the life-cycle of assets deployed onto the node file system. In particular, these assets include a Kubernetes Pod manifest to be consumed by kubelet and it manages the processes specified by the Pod. That management continues even when the node reboots, since static pods like this are run by the kubelet even when the apiserver is not available.

Cleanup of these resources is managed by the anchor pods' preStop life-cycle hooks. This is usually simply removing the files originally placed on the nodes' file systems, but, e.g. in the case of Etcd, can actually be used to manage more complex cleanup like removal from cluster membership.

Alternatives

Kubeadm
- Does not yet support HA
- Current approach to HA Etcd is to use the etcd opreator, which recovers from cluster reboot by loading from an external backup snapshot
- Does not support chart-based management of components
kops
- Does not support bare metal
Bootkube
- Does not support automatic recovery from a full cluster reboot
- Does not yet support full HA
- Adheres to different design goals (minimal direct server contact), which makes some of these changes challenging, e.g. building a self-contained, multi-master cluster
- Does not support chart-based management of components

8.7 KiB Raw Blame History