Commit Graph

10 Commits

Author SHA1 Message Date
Ruslan Aliev 12f448963f ETCD improvements
* remove healthcheck sidecar, perform probes in etcd
   container itself, failing liveness probes in sidecar
   do not restart problematic etcd container;
 * verify that etcdctl member list cmd in anchor is
   always successfull;
 * adjust ETCDCTL_ENDPOINTS env in etcd container to
   POD_IP variable instead of localhost (127.0.0.1);
 * add liveness/readiness probes to auxiliary etcd as
   well as properly passing etcd configuration variables
   as strings;
 * monitor current leader in initial etcd cluster, in case
   if aux member is current leader pass it to permenant
   member, same check applies for aux suicide process;
 * etcd aux pod will be alive unless all permanent nodes
   come up and join the cluster plus apiserver no longer
   relies on aux members;
 * add 5 seconds sleep between aux member remove for more
   smooth transition process.

Signed-off-by: Ruslan Aliev <raliev@mirantis.com>
Change-Id: I7918072a6ba5a6b22b359d1616def8c31425462d
2024-04-25 01:01:06 -05:00
Phil Sphicas d161528ae8 Avoid calico-etcd crashloop
Sometimes the calico-etcd pod crashloops when it is being bootstrapped.
This occurs intermittently in the gates.

Best guess .. when the etcd-anchor pod initially creates the etcd static
manifest, it waits for the anchor period (15 seconds) for the etcd pod
to become ready. If it is not ready, the next iteration through the loop
recreates an identical manifest. The fact that it is a new file causes
kubelet to terminate the original container and start up a new one.

Kubelet and the container runtime get out of sync, and kubelet can't
figure out the correct container id, so the pod ends up crashlooping
forever.  Manually removing and readding the manifest file doesn't
resolve the condition, although a kubelet restart actually does.

This "fix" will only write the updated manifest if it is different, and
hopefully will prevent the condition from occurring.

Change-Id: I4b6b1bf17fd8f0b36d24a741779505b38dba349f
2021-02-11 07:14:49 +00:00
Matt McEuen 1d0a4619b4 Add -u to anchor scripts
This adds "set -u" (in addition to the existing -x) to the anchor
scripts. This should fix an issue seen occasionally in the haproxy
chart which is only explainable by the IDENTIFIER variable failing
to get set correctly.

All variables used in the anchor scripts ought to be defined, and
there's no need to rely on blank strings as defaults.

"set -e" was considered for this, but may have unintended side-effects:
-u should be safe and avoid the issue we've seen.

Change-Id: Idbc2f9f77d4754874999d5d83d322a17076c7392
2020-02-03 14:00:12 -06:00
Scott Hussey 6aeab9e490 (etcd) Support dash shell
- Rewrite some anchor scripting to support dash
  - 'function' not supported, refactor POSIX function declarations
- Rewrite aux monitor to support dash
  - Same
Change-Id: If44c59be2f30fd30c1a668bc27e58b37575610b5
2019-09-01 01:22:44 -05:00
Hussey, Scott (sh8121) d2f020fbb7 Allow etcd anchor to recover from bad state
- If an etcd member has corrupted data or has somehow
  been removed from a cluster, the anchor does not currently
  recover. This change adds a threshold of X monitoring loops
  after which the anchor will remove the member from the cluster
  and recreate it.

Note: This is safe due to etcd's strict quorum checking on
      runtime reconfiguration, see [0].

[0] https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--strict-reconfig-check

Change-Id: Id2ceea7393c46bed9fa5e3ead37014e52c91eac3
2019-06-26 07:56:59 -05:00
Michael Beaver 8b45a36419 Secure host file permissions
* added in missing recursive flag to the chmod command used to remove
extraneous permissions from CURATED_DIRS
* added commands to change permissions for manifests and configurations
that are copied to the host

Change-Id: I174db09061c3162db11dd976a55132f5fad7a80d
2018-10-19 13:50:18 -05:00
Mark Burnett dcac36c8cf Fix: Avoid etcd bootstrap race
This adds a sleep to avoid a tight restart loop for etcd when running in
bootstrap mode (e.g. to spin up etcd for calico).

This doesn't seem to have manifested before, but I saw it while
troubleshooting an environment yesterday, and I'm surprised it hasn't
been seen before.

The issue manifests as repeated teardown and replacement of the
bootstrapping <svc>-etcd-<hostname> pod put in place by the anchor.  The
log messages in the etcd container of the pod will say that etcd is
terminating because it got SIGTERM, and a large number of pause
containers will be left behind and visible in `docker ps -a`.  The
constant pod replacement was racing with how quickly kubernetes would
see the healthy (non-anchor) etcd pod allowing the anchor to be able to
reach etcd over the kubernetes service to check its health.  A successful
health check by the anchor ends the bootstrapping phase, exiting the
race.

I'm confident there's a better approach to clean this section of code
up; however, the concern with this PS is to address the problematic
tight loop, allowing a more rigorous improvement to come later.

Change-Id: I0e3181194cfcd376967672b47a5e126103b4dfe4
2018-09-07 07:52:44 -05:00
Aaron Sheffield cf0037597d Fixes etcd race condition bug
- During genesis there was a race condition on the genesis node leaving
   and other nodes joining.
- Updated etcd anchor to update the config when a host is not healthy.

fixes #54

Change-Id: I0ba2c831c73cc3136ee635e7d0c0efcc8b009858
2018-03-21 20:14:00 -05:00
Anthony Lin 3b4b4661a4 Refactor etcd Chart
Refactor etcd chart to align with OSH standards

Change-Id: Ie71fcf045b3ec896dcdd03bb3455fb85af8f2e7a
2017-11-29 17:33:41 +00:00
Mark Burnett 95643147c5 Migrate to self hosted using charts
This change includes several interconnected features:

* Migration to Deckhand-based configuration.  This is integrated here,
  because new configuration data were needed, so it would have been
  wasted effort to either implement it in the old format or to update
  the old configuration data to Dechkand format.
* Failing faster with stronger validation.  Migration to Deckhand
  configuration was a good opportunity to add schema validation, which
  is a requirement in the near term anyway.  Additionally, rendering
  all templates up front adds an additional layer of "fail-fast".
* Separation of certificate generation and configuration assembly into
  different commands.  Combined with Deckhand substitution, this creates
  a much clearer distinction between Promenade configuration and
  deployable secrets.
* Migration of components to charts.  This is a key step that will
  enable support for dynamic node management.  Additionally, this paves
  the way for significant configurability in component deployment.
* Version of kubelet is configurable & controlled via download url.
* Restructuring templates to be more intuitive.  Many of the templates
  require changes or deletion due to the migration to charts.
* Installation of pre-configured useful tools on hosts, including calicoctl.
* DNS is now provided by coredns, which is highly configurable.

Change-Id: I9f2d8da6346f4308be5083a54764ce6035a2e10c
2017-10-17 13:29:46 -05:00