Update K8s Preflight Check Operator

It seems that it is possible for pods to go into 'MatchNodeSelector' status after a hard reboot of the node due to reasons mentioned in [0]. The current preflight check opertor will fail the health checks even when the entire cluster goes back to normal after the hard reboot as it will flag any pod(s) that are not in 'Succeeded' or 'Running' state. This means that our workflow will stop and go into failed state. This patch set is meant to take care of such scenario and to relax the health check requirements for the k8s cluster by logging the information of such pods instead of failing the workflow (note that the status of such pods will resemble [1]). [0] https://github.com/kubernetes/kubernetes/issues/52902 [1] 'status': {'conditions': None, 'container_statuses': None, 'host_ip': None, 'init_container_statuses': None, 'message': 'Pod Predicate MatchNodeSelector failed', 'phase': 'Failed', 'pod_ip': None, 'qos_class': None, 'reason': 'MatchNodeSelector', 'start_time': datetime.datetime(2018, 3, 30, 15, 49, 39, tzinfo=tzlocal())}} Change-Id: Idb1208d93cddc01cd0375a5ac2e6e73dd3dfad61
2018-03-30 17:35:33 +00:00 · 2018-03-30 17:35:33 +00:00 · 017faba69f
parent e178005143
commit 017faba69f
1 changed files with 24 additions and 2 deletions
--- a/shipyard_airflow/plugins/k8s_preflight_checks_operator.py
+++ b/shipyard_airflow/plugins/k8s_preflight_checks_operator.py
@ -1,4 +1,4 @@
-# Copyright 2017 AT&T Intellectual Property.  All other rights reserved.
+# Copyright 2018 AT&T Intellectual Property.  All other rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@ -49,7 +49,29 @@ class K8sHealthCheckOperator(BaseOperator):
                         i.status.phase)

            if i.status.phase not in ['Succeeded', 'Running']:
-                raise AirflowException("Kubernetes Health Checks Failed!")
+                # NOTE: Kubelet receives information about the pods
+                # and node from etcd after a restart. It seems that
+                # it is possible for kubelet to set the pod status to
+                # 'MatchNodeSelector' after a hard reboot of the node.
+                # This might happen if the labels in the initial node
+                # info is different from the node info in etcd, which
+                # will in turn cause the pod admission to fail.
+                #
+                # As the system does recover after a hard reboot with
+                # new pods created for various services, there is a need
+                # to ignore the failed pods with 'MatchNodeSelector' status
+                # to avoid false alarms. Hence the approach that we will
+                # be taking in such situation will be to log warning messages
+                # printing the current state of these pods as opposed to
+                # failing the health checks.
+                if (i.status.phase == 'Failed' and
+                        i.status.container_statuses is None and
+                        i.status.reason == 'MatchNodeSelector'):
+                    logging.warning("%s is in %s state with status",
+                                    i.metadata.name, i.status.phase)
+                    logging.warning(i.status)
+                else:
+                    raise AirflowException("Kubernetes Health Checks Failed!")


 class K8sHealthCheckPlugin(AirflowPlugin):