Since after v3.5.6 etcd-io switched to a
distroless base image. Etcd anchor pods
are now using etcd-utility and etcd is
running a sidecar for health checks.
Change-Id: I198dca1209097de4d60a53a7568f0c4790679599
This PS adds a possibility to limit (to throttle) the number of
simultaneously uploaded backups while keeping the logic on the client
side using flag files on remote side.
Change-Id: I753faab8f3d934346d54e38bfc94cec3a8f79385
This PS adds staggered backups possibility by adding anti-affinity rules
to backups cronjobs that can be followed across several namespaces to
decrease load on remote backup destination server making sure that at
every moment in time there is only one backup upload is in progress.
Change-Id: I320c6ce6370b45c602114189819a4225e479f680
This PS updates python modules and code to match Airflow 2.6.2:
- bionic py36 gates were removed
- python code corrected to match new modules versions
- selection of python modules versions was perfoemed based on
airflow-2.6.2 constraints
Change-Id: I9c3e139b3437414a61af7e7c0b7d7e533fadefda
Updating etcd chart with added backup validation function empty implementation(subject for future realization). This has to be done because helm-toolkit chart in openstack-helm-infra is now calling that function verify_databases_backup_archives() as part of backup_databases() function implementation:
https://review.opendev.org/c/openstack/openstack-helm-infra/+/853027
Changed apiVersion of etcd cronjob from batch/v1beta to batch/v1 and fixed securityContext for etcd_backup.
Also bumping up HTK version to 0.2.48 from a commit id obtained from merge of https://review.opendev.org/c/openstack/openstack-helm-infra/+/853027 and set proper commit id in this file: tools/helm_tk.sh
Change-Id: Ie047dd0e6a2aae6483ace89cad22d6720890cdfc
Address changes and deprecations in Kubernetes v1.21=>v1.23
controller-manager:
* --authorization-kubeconfig and --authentication-kubeconfig must be set
* liveness/readiness probes must use HTTPS
* the default port has been changed to 10257
kubelet:
* --dynamic-config-dir has been deprecated, will not move to GA
* --cni-bin-dir has been deprecated, will be removed with dockershim
* --cni-conf-dir has been deprecated, will be removed with dockershim
* --network-plugin has been deprecated, will be removed with dockershim
https: //github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.23.md#deprecation
https: //kubernetes.io/docs/tasks/administer-cluster/reconfigure-kubelet/
https: //github.com/kubernetes/enhancements/tree/master/keps/sig-node/281-dynamic-kubelet-configuration
Change-Id: Ia996d7c14d81d1d8b8067f11c02ffb4ce90eb49a
Removing set -x from within the dump_databases_to_directory function.
The set -x from within the function is causing all the code that
follows the function call to have debug tracing on. This in turns
causing multiple identical logs for the same event. Looking at this
function, there should be enough logging to aid debugging.
Reference ps: https://review.opendev.org/c/openstack/openstack-helm-infra/+/830533
(commit 2fc1ce4a142e605a9fc6c90dceabbf7c4bfb81e3)
Change-Id: Id442972bbcca983afab7c4f3c29f3686e9e0b481
Pick up the helm-toolkit DB backup enhancement in etcd
to add capability to retry uploading backup to remote server.
Change-Id: If6ea347a4c2c55f14f35d95681aaf482d0a6103c
1) Uplift helm-toolkit to include db-backup-restore error log string
prefixes for the generation of alert
https://review.opendev.org/c/openstack/openstack-helm-infra/+/823867
2) Error log string prefixes are added to etcd backup-restore as well
Change-Id: Iad51a3e55567d0861140a97c17a1b7d859e13938
Update applicable charts to use non-deprecated APIs [0], specifically
addressing the following resource types:
* ClusterRole
* ClusterRoleBinding
* Role
* Rolebinding
The APIs being migrated to are available in v1.19 or earlier. As of this
change, v1.19 is the oldest supported Kubernetes version, slated for EOL
on 2021-10-28. [1]
0: https://kubernetes.io/docs/reference/using-api/deprecation-guide/
1: https://kubernetes.io/releases/
Change-Id: I134b201d9ae01a8d74e34ee14f3bfe3b960cb5aa
* Give kube-proxy a blanket toleration
* Replace scheduler.alpha.kubernetes.io/critical-pod annotation with
priorityClassName: system-node-critical
Change-Id: I810333913c09531eefa1ded014fe090d4cca7f7d
Sometimes the calico-etcd pod crashloops when it is being bootstrapped.
This occurs intermittently in the gates.
Best guess .. when the etcd-anchor pod initially creates the etcd static
manifest, it waits for the anchor period (15 seconds) for the etcd pod
to become ready. If it is not ready, the next iteration through the loop
recreates an identical manifest. The fact that it is a new file causes
kubelet to terminate the original container and start up a new one.
Kubelet and the container runtime get out of sync, and kubelet can't
figure out the correct container id, so the pod ends up crashlooping
forever. Manually removing and readding the manifest file doesn't
resolve the condition, although a kubelet restart actually does.
This "fix" will only write the updated manifest if it is different, and
hopefully will prevent the condition from occurring.
Change-Id: I4b6b1bf17fd8f0b36d24a741779505b38dba349f
Since we introduced chart version check in gates, requirements are not
satisfied with strict check of 0.1.0
Change-Id: Ifd2d7af1f2dabe9bbccd65551e0223dddff529dc
The kubernetes-etcd pods are leaving behind zombie processes and
setting 'shareProcessNamespace: true' eliminates that problem.
When you enable process namespace sharing for a Pod, Kubernetes uses a
single process namespace for all the containers in that Pod. The
Kubernetes Pod infrastructure container becomes PID 1 and automatically
reaps orphaned processes. [0]
[0]https://cloud.google.com/solutions/best-practices-for-building-containers#solution_2_enable_process_namespace_sharing_in_kubernetes
Change-Id: I61566fb71258baafa709b0e5367c71f13e980f6f
Include fix [0] return code when remote rgw fails.
Moving set -x in backup/restore.tpl below the source
of the framework code to reduce debug output.
[0] https://review.opendev.org/#/c/738665/
Change-Id: If9b7b317dff439ecb293d9837cac256884c53c6a
1) Include framework for remote etcd backups.
2) Use porthole etcdctl utility image for backups.
3) Move helm-toolkit pin to latest commit.
4) Add a keystone user for RGW.
5) Add a secret for Swift API access.
6) Add a secret for backup/restore configuration.
Change-Id: Ica549c3b6bc00ca55540b8ffedd4c46af0d8d25e
This updates the coredns, haproxy and etcd chart to include the pod
security context on the pod template.
This also adds the container security context to set
readOnlyRootFilesystem flag
Change-Id: I9b5b0ea83acd4c5656577d8cbc684a5031ca0111
Adds configmap-hash annotations to the etcd anchor daemonset for
configmap-bin and configmap-etc.
Does not add hash annotations for configmap-certs or secret-keys, with
the thought that if certs or keys are changed, some manual intervention
might be warranted, and restarting the anchors automatically might not
be desirable.
Change-Id: I22ff8fafa5d37c10138ddaa4095174b25fc087d8
The anchor pre-stop script uses the 'function' command, which fails when
using dash. This change removes it for compatibility.
Change-Id: I6591045fa0a555800a03edbdf1f9f3a8476dd0a3
Allows extra environment variables to be applied to the etcd pods. Can
be used to apply tuning parameters, enable experimental flags, etc.
Change-Id: I9d82514b6e3a292edc472d885c0a61d5c81199f5
This adds "set -u" (in addition to the existing -x) to the anchor
scripts. This should fix an issue seen occasionally in the haproxy
chart which is only explainable by the IDENTIFIER variable failing
to get set correctly.
All variables used in the anchor scripts ought to be defined, and
there's no need to rely on blank strings as defaults.
"set -e" was considered for this, but may have unintended side-effects:
-u should be safe and avoid the issue we've seen.
Change-Id: Idbc2f9f77d4754874999d5d83d322a17076c7392
This adds -e to the pre_stop scripts, so that they fail out if
any of their commands fail. This is required, since it's the only
way to communicate whether there is an issue during pre_hook
execution.
"The logs for a Hook handler are not exposed in Pod events.
If a handler fails for some reason, it broadcasts an event."
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks
As an example, this issue was discovered when "touch /tmp/stop"
was failing silently due to a readOnlyRootFilesystem setting,
resulting in pods that would not successfully Terminate until
the grace period was exhausted.
Change-Id: Ic9a228230d944530e31ed61f4239fd434cbb6187
kubernetes-controller-manager-anchor pods get stuck in Terminating state
because the pre-stop script tries to touch /tmp/stop, which is on a read
only root filesystem.
This change mounts an emptyDir at /tmp to resolve the issue.
The same change is applied to apiserver, etcd, and scheduler anchors, to
prevent the issue if readOnlyRootFilesystem is enabled.
Related change for haproxy:
https://review.opendev.org/685711/
Change-Id: I784498e0dc24da91a983716029973919b96a3055
- Rewrite some anchor scripting to support dash
- 'function' not supported, refactor POSIX function declarations
- Rewrite aux monitor to support dash
- Same
Change-Id: If44c59be2f30fd30c1a668bc27e58b37575610b5
This commit enables configuration of probes
for etcd pod by manipulating/overriding values in
values.yaml or through manifests.
Change-Id: I69eabd13f8ea8b97a33281ad993ec2e88b9280bc
1. Fix directory listing used to identify newest backup file to be
archived (was sometimes archiving files twice; e.g., a.tar.gz.tar.gz)
2. Fix directory listings used to identify and clean up old backups
Change-Id: Icb1ddd96613f4ab6a28c4f617001c336951568bc
- If an etcd member has corrupted data or has somehow
been removed from a cluster, the anchor does not currently
recover. This change adds a threshold of X monitoring loops
after which the anchor will remove the member from the cluster
and recreate it.
Note: This is safe due to etcd's strict quorum checking on
runtime reconfiguration, see [0].
[0] https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/configuration.md#--strict-reconfig-check
Change-Id: Id2ceea7393c46bed9fa5e3ead37014e52c91eac3
This updates the etcd chart to include the pod
security context on the pod template.
This also adds the container security context to set
readOnlyRootFilesystem flag to false
Change-Id: I34a8ab3e850779192491b9b127a82b82f05fa00b