From 5c84aec587b76a54c300dd5feeeab7b083d41e36 Mon Sep 17 00:00:00 2001 From: Evgeny L Date: Thu, 18 Apr 2019 20:12:50 +0000 Subject: [PATCH] Initial implementation of Troubleshooting Guide Add an initial implementation of Airship Troubleshooting Guide that users can use when they encounter problems with their Airship installation. Change-Id: I9c5546cbc5f12db81cc3fcc6a3be95e8dd6f52fe --- doc/source/index.rst | 1 + doc/source/troubleshooting_guide.rst | 177 +++++++++++++++++++++++++++ 2 files changed, 178 insertions(+) create mode 100644 doc/source/troubleshooting_guide.rst diff --git a/doc/source/index.rst b/doc/source/index.rst index 63c1107ed..350bf3f45 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -193,6 +193,7 @@ Process Flows :maxdepth: 2 authoring_and_deployment + troubleshooting_guide seaworthy airskiff airsloop diff --git a/doc/source/troubleshooting_guide.rst b/doc/source/troubleshooting_guide.rst new file mode 100644 index 000000000..b4e9475fd --- /dev/null +++ b/doc/source/troubleshooting_guide.rst @@ -0,0 +1,177 @@ +Troubleshooting Guide +===================== + +This guide provides information on troubleshooting of an Airship +environment. Debugging of any software component starts with gathering +more information about the failure, so the intention of the document +is not to describe specific issues that one can encounter, but to provide +a generic set of instructions that a user can follow to find the +root cause of the problem. + +For additional support you can contact the Airship team via +`IRC or mailing list `__, +use `Airship bug tracker `__ +to search and create issues. + +Configuring Airship CLI +----------------------- + +Many commands from this guide use Airship CLI, this section describes +how to get it configured on your environment. + +:: + + git clone https://opendev.org/airship/treasuremap + cd treasuremap/ + # List available tags. + git tag --list + # Switch to the version of your site. + git checkout {your-tag} + # Go back to a previous directory. + cd .. + # Run it without arguments to get a help message. + sudo ./treasuremap/tools/airship + +Manifests Preparation +--------------------- + +When you do any configuration changes to the manifests, there are a few +commands that you can use to validate the changes without uploading them +to the Airship environment. + +Run ``lint`` command for your site; it helps to catch the errors related +to documents duplication, broken references, etc. + +Example: + +:: + + sudo ./treasuremap/tools/airship pegleg site -r airship-treasuremap/ \ + lint {site-name} + +If you create configuration overrides or do changes to substitutions, +it is recommended to run ``render`` command this command merges the layers +and renders all substitutions. This allows finding what parameters are +passed to Helm as overrides for Charts' defaults. + +Example: + +:: + + # Saves the result into rendered.txt file. + sudo ./treasuremap/tools/airship pegleg site -r treasuremap/ \ + render -o rendered.txt ${SITE} + +Deployment Failure +------------------ + +During the deployment, it is important to identify a specific step +where it fails, there are two major deployment steps: + +1. **Drydock build**: deploys Operating System. +2. **Armada build**: deploys Helm Charts. + +After `Configuring Airship CLI`_, setup credentials for accessing +Shipyard; the password is stored in ``ucp_shipyard_keystone_password`` +secret, you can find it in +``site/airship-seaworthy/secrets/passphrases/ucp_shipyard_keystone_password.yaml`` +configuration file of your site. + +:: + + export OS_USERNAME=shipyard + export OS_PASSWORD={shipyard_password} + +Now you can use the following commands to access Shipyard: + +:: + + # Get all actions that were executed on you environment. + sudo ./treasuremap/tools/airship shipyard get actions + # Show all the steps within the action. + sudo ./treasuremap/tools/airship shipyard describe action/{action_id} + # Get a bit more details on the step. + sudo ./treasuremap/tools/airship shipyard describe step/{action_id}/armada_build + # Print the logs from the step. + sudo ./treasuremap/tools/airship shipyard logs step/{action_id}/armada_build + + +After the failed step is determined, you can access the logs of a specific +service (e.g., drydock-api/maas or armada-api) to get more information +on the failure, note that there may be multiple pods of a single service +running, you need to check all of them to find where the most recent +logs are available. + +Example of accessing Armada API logs: + +:: + + # Get all pods running on the cluster and find a name of the pod you are + # interested in. + kubectl get pods -o wide --all-namespaces + + # See the logs of specific pod. + kubectl logs -n ucp -f --tail 200 armada-api-d5f757d5-6z6nv + +In some cases you want to restart your pod, there is no dedicated command for +that in Kubernetes. However, you can delete the pod, it will be restarted +by Kubernetes to satisfy replication factor. + +:: + + # Restart Armada API service. + kubectl delete pod -n ucp armada-api-d5f757d5-6z6nv + +Ceph +---- + +Many stateful services in Airship rely on Ceph to function correctly. +For more information on Ceph debugging follow an official +`Ceph debugging guide `__. + +Although Ceph tolerates failures of multiple OSDs, it is important +to make sure that your Ceph cluster is healthy. + +Example: + +:: + + # Get a name of Ceph Monitor pod. + CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ + grep ceph-mon | sed -n 1p | sed 's|pod/||') + # Get the status of the Ceph cluster. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph -s + +Cluster is in a helthy state when ``health`` parameter is set to ``HEALTH_OK``. + +When the cluster is unhealthy, and some Placement Groups are reported to be in +degraded or down states, determine the problem by inspecting the logs of +Ceph OSD that is down using ``kubectl``. + +:: + + # Get a name of Ceph Monitor pod. + CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ + grep ceph-mon | sed -n 1p | sed 's|pod/||') + # List a hierarchy of OSDs in the cluster to see what OSDs are down. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph osd tree + +There are a few other commands that may be useful during the debugging: + +:: + + # Get a name of Ceph Monitor pod. + CEPH_MON=$(sudo kubectl get pods --all-namespaces -o=name | \ + grep ceph-mon | sed -n 1p | sed 's|pod/||') + + # Get a detailed information on the status of every Placement Group. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- ceph pg dump + + # List allocated block devices. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd ls + # See what client uses the device. + sudo kubectl exec -it -n ceph ${CEPH_MON} -- rbd status \ + kubernetes-dynamic-pvc-e71e65a9-3b99-11e9-bf31-e65b6238af01 + + # List all Ceph block devices mounted on a specific host. + mount | grep rbd