promenade

Commit Graph

Author	SHA1	Message	Date
Ruslan Aliev	12f448963f	ETCD improvements * remove healthcheck sidecar, perform probes in etcd container itself, failing liveness probes in sidecar do not restart problematic etcd container; * verify that etcdctl member list cmd in anchor is always successfull; * adjust ETCDCTL_ENDPOINTS env in etcd container to POD_IP variable instead of localhost (127.0.0.1); * add liveness/readiness probes to auxiliary etcd as well as properly passing etcd configuration variables as strings; * monitor current leader in initial etcd cluster, in case if aux member is current leader pass it to permenant member, same check applies for aux suicide process; * etcd aux pod will be alive unless all permanent nodes come up and join the cluster plus apiserver no longer relies on aux members; * add 5 seconds sleep between aux member remove for more smooth transition process. Signed-off-by: Ruslan Aliev <raliev@mirantis.com> Change-Id: I7918072a6ba5a6b22b359d1616def8c31425462d	2024-04-25 01:01:06 -05:00
Phil Sphicas	d161528ae8	Avoid calico-etcd crashloop Sometimes the calico-etcd pod crashloops when it is being bootstrapped. This occurs intermittently in the gates. Best guess .. when the etcd-anchor pod initially creates the etcd static manifest, it waits for the anchor period (15 seconds) for the etcd pod to become ready. If it is not ready, the next iteration through the loop recreates an identical manifest. The fact that it is a new file causes kubelet to terminate the original container and start up a new one. Kubelet and the container runtime get out of sync, and kubelet can't figure out the correct container id, so the pod ends up crashlooping forever. Manually removing and readding the manifest file doesn't resolve the condition, although a kubelet restart actually does. This "fix" will only write the updated manifest if it is different, and hopefully will prevent the condition from occurring. Change-Id: I4b6b1bf17fd8f0b36d24a741779505b38dba349f	2021-02-11 07:14:49 +00:00
Matt McEuen	1d0a4619b4	Add -u to anchor scripts This adds "set -u" (in addition to the existing -x) to the anchor scripts. This should fix an issue seen occasionally in the haproxy chart which is only explainable by the IDENTIFIER variable failing to get set correctly. All variables used in the anchor scripts ought to be defined, and there's no need to rely on blank strings as defaults. "set -e" was considered for this, but may have unintended side-effects: -u should be safe and avoid the issue we've seen. Change-Id: Idbc2f9f77d4754874999d5d83d322a17076c7392	2020-02-03 14:00:12 -06:00
Scott Hussey	6aeab9e490	(etcd) Support dash shell - Rewrite some anchor scripting to support dash - 'function' not supported, refactor POSIX function declarations - Rewrite aux monitor to support dash - Same Change-Id: If44c59be2f30fd30c1a668bc27e58b37575610b5	2019-09-01 01:22:44 -05:00
Hussey, Scott (sh8121)	d2f020fbb7	Allow etcd anchor to recover from bad state - If an etcd member has corrupted data or has somehow been removed from a cluster, the anchor does not currently recover. This change adds a threshold of X monitoring loops after which the anchor will remove the member from the cluster and recreate it. Note: This is safe due to etcd's strict quorum checking on runtime reconfiguration, see [0]. [0] https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--strict-reconfig-check Change-Id: Id2ceea7393c46bed9fa5e3ead37014e52c91eac3	2019-06-26 07:56:59 -05:00
Michael Beaver	8b45a36419	Secure host file permissions * added in missing recursive flag to the chmod command used to remove extraneous permissions from CURATED_DIRS * added commands to change permissions for manifests and configurations that are copied to the host Change-Id: I174db09061c3162db11dd976a55132f5fad7a80d	2018-10-19 13:50:18 -05:00
Mark Burnett	dcac36c8cf	Fix: Avoid etcd bootstrap race This adds a sleep to avoid a tight restart loop for etcd when running in bootstrap mode (e.g. to spin up etcd for calico). This doesn't seem to have manifested before, but I saw it while troubleshooting an environment yesterday, and I'm surprised it hasn't been seen before. The issue manifests as repeated teardown and replacement of the bootstrapping <svc>-etcd-<hostname> pod put in place by the anchor. The log messages in the etcd container of the pod will say that etcd is terminating because it got SIGTERM, and a large number of pause containers will be left behind and visible in `docker ps -a`. The constant pod replacement was racing with how quickly kubernetes would see the healthy (non-anchor) etcd pod allowing the anchor to be able to reach etcd over the kubernetes service to check its health. A successful health check by the anchor ends the bootstrapping phase, exiting the race. I'm confident there's a better approach to clean this section of code up; however, the concern with this PS is to address the problematic tight loop, allowing a more rigorous improvement to come later. Change-Id: I0e3181194cfcd376967672b47a5e126103b4dfe4	2018-09-07 07:52:44 -05:00
Aaron Sheffield	cf0037597d	Fixes etcd race condition bug - During genesis there was a race condition on the genesis node leaving and other nodes joining. - Updated etcd anchor to update the config when a host is not healthy. fixes #54 Change-Id: I0ba2c831c73cc3136ee635e7d0c0efcc8b009858	2018-03-21 20:14:00 -05:00
Anthony Lin	3b4b4661a4	Refactor etcd Chart Refactor etcd chart to align with OSH standards Change-Id: Ie71fcf045b3ec896dcdd03bb3455fb85af8f2e7a	2017-11-29 17:33:41 +00:00
Mark Burnett	95643147c5	Migrate to self hosted using charts This change includes several interconnected features: * Migration to Deckhand-based configuration. This is integrated here, because new configuration data were needed, so it would have been wasted effort to either implement it in the old format or to update the old configuration data to Dechkand format. * Failing faster with stronger validation. Migration to Deckhand configuration was a good opportunity to add schema validation, which is a requirement in the near term anyway. Additionally, rendering all templates up front adds an additional layer of "fail-fast". * Separation of certificate generation and configuration assembly into different commands. Combined with Deckhand substitution, this creates a much clearer distinction between Promenade configuration and deployable secrets. * Migration of components to charts. This is a key step that will enable support for dynamic node management. Additionally, this paves the way for significant configurability in component deployment. * Version of kubelet is configurable & controlled via download url. * Restructuring templates to be more intuitive. Many of the templates require changes or deletion due to the migration to charts. * Installation of pre-configured useful tools on hosts, including calicoctl. * DNS is now provided by coredns, which is highly configurable. Change-Id: I9f2d8da6346f4308be5083a54764ce6035a2e10c	2017-10-17 13:29:46 -05:00

10 Commits