* remove healthcheck sidecar, perform probes in etcd
container itself, failing liveness probes in sidecar
do not restart problematic etcd container;
* verify that etcdctl member list cmd in anchor is
always successfull;
* adjust ETCDCTL_ENDPOINTS env in etcd container to
POD_IP variable instead of localhost (127.0.0.1);
* add liveness/readiness probes to auxiliary etcd as
well as properly passing etcd configuration variables
as strings;
* monitor current leader in initial etcd cluster, in case
if aux member is current leader pass it to permenant
member, same check applies for aux suicide process;
* etcd aux pod will be alive unless all permanent nodes
come up and join the cluster plus apiserver no longer
relies on aux members;
* add 5 seconds sleep between aux member remove for more
smooth transition process.
Signed-off-by: Ruslan Aliev <raliev@mirantis.com>
Change-Id: I7918072a6ba5a6b22b359d1616def8c31425462d
Sometimes the calico-etcd pod crashloops when it is being bootstrapped.
This occurs intermittently in the gates.
Best guess .. when the etcd-anchor pod initially creates the etcd static
manifest, it waits for the anchor period (15 seconds) for the etcd pod
to become ready. If it is not ready, the next iteration through the loop
recreates an identical manifest. The fact that it is a new file causes
kubelet to terminate the original container and start up a new one.
Kubelet and the container runtime get out of sync, and kubelet can't
figure out the correct container id, so the pod ends up crashlooping
forever. Manually removing and readding the manifest file doesn't
resolve the condition, although a kubelet restart actually does.
This "fix" will only write the updated manifest if it is different, and
hopefully will prevent the condition from occurring.
Change-Id: I4b6b1bf17fd8f0b36d24a741779505b38dba349f
This adds "set -u" (in addition to the existing -x) to the anchor
scripts. This should fix an issue seen occasionally in the haproxy
chart which is only explainable by the IDENTIFIER variable failing
to get set correctly.
All variables used in the anchor scripts ought to be defined, and
there's no need to rely on blank strings as defaults.
"set -e" was considered for this, but may have unintended side-effects:
-u should be safe and avoid the issue we've seen.
Change-Id: Idbc2f9f77d4754874999d5d83d322a17076c7392
- Rewrite some anchor scripting to support dash
- 'function' not supported, refactor POSIX function declarations
- Rewrite aux monitor to support dash
- Same
Change-Id: If44c59be2f30fd30c1a668bc27e58b37575610b5
- If an etcd member has corrupted data or has somehow
been removed from a cluster, the anchor does not currently
recover. This change adds a threshold of X monitoring loops
after which the anchor will remove the member from the cluster
and recreate it.
Note: This is safe due to etcd's strict quorum checking on
runtime reconfiguration, see [0].
[0] https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--strict-reconfig-check
Change-Id: Id2ceea7393c46bed9fa5e3ead37014e52c91eac3
* added in missing recursive flag to the chmod command used to remove
extraneous permissions from CURATED_DIRS
* added commands to change permissions for manifests and configurations
that are copied to the host
Change-Id: I174db09061c3162db11dd976a55132f5fad7a80d
This adds a sleep to avoid a tight restart loop for etcd when running in
bootstrap mode (e.g. to spin up etcd for calico).
This doesn't seem to have manifested before, but I saw it while
troubleshooting an environment yesterday, and I'm surprised it hasn't
been seen before.
The issue manifests as repeated teardown and replacement of the
bootstrapping <svc>-etcd-<hostname> pod put in place by the anchor. The
log messages in the etcd container of the pod will say that etcd is
terminating because it got SIGTERM, and a large number of pause
containers will be left behind and visible in `docker ps -a`. The
constant pod replacement was racing with how quickly kubernetes would
see the healthy (non-anchor) etcd pod allowing the anchor to be able to
reach etcd over the kubernetes service to check its health. A successful
health check by the anchor ends the bootstrapping phase, exiting the
race.
I'm confident there's a better approach to clean this section of code
up; however, the concern with this PS is to address the problematic
tight loop, allowing a more rigorous improvement to come later.
Change-Id: I0e3181194cfcd376967672b47a5e126103b4dfe4
- During genesis there was a race condition on the genesis node leaving
and other nodes joining.
- Updated etcd anchor to update the config when a host is not healthy.
fixes #54
Change-Id: I0ba2c831c73cc3136ee635e7d0c0efcc8b009858
This change includes several interconnected features:
* Migration to Deckhand-based configuration. This is integrated here,
because new configuration data were needed, so it would have been
wasted effort to either implement it in the old format or to update
the old configuration data to Dechkand format.
* Failing faster with stronger validation. Migration to Deckhand
configuration was a good opportunity to add schema validation, which
is a requirement in the near term anyway. Additionally, rendering
all templates up front adds an additional layer of "fail-fast".
* Separation of certificate generation and configuration assembly into
different commands. Combined with Deckhand substitution, this creates
a much clearer distinction between Promenade configuration and
deployable secrets.
* Migration of components to charts. This is a key step that will
enable support for dynamic node management. Additionally, this paves
the way for significant configurability in component deployment.
* Version of kubelet is configurable & controlled via download url.
* Restructuring templates to be more intuitive. Many of the templates
require changes or deletion due to the migration to charts.
* Installation of pre-configured useful tools on hosts, including calicoctl.
* DNS is now provided by coredns, which is highly configurable.
Change-Id: I9f2d8da6346f4308be5083a54764ce6035a2e10c