diff --git a/docs/source/blueprints/blueprints.rst b/docs/source/blueprints/blueprints.rst deleted file mode 100644 index a9218001..00000000 --- a/docs/source/blueprints/blueprints.rst +++ /dev/null @@ -1,28 +0,0 @@ -.. - Copyright 2018 AT&T Intellectual Property. - All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); you may - not use this file except in compliance with the License. You may obtain - a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, WITHOUT - WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the - License for the specific language governing permissions and limitations - under the License. - -.. _blueprints: - -Blueprints -========== - -Designs for features of the UCP. - -.. toctree:: - :maxdepth: 2 - - deployment-grouping-baremetal - node-teardown diff --git a/docs/source/blueprints/deployment-grouping-baremetal.rst b/docs/source/blueprints/deployment-grouping-baremetal.rst deleted file mode 100644 index 88a95464..00000000 --- a/docs/source/blueprints/deployment-grouping-baremetal.rst +++ /dev/null @@ -1,553 +0,0 @@ -.. - Copyright 2018 AT&T Intellectual Property. - All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); you may - not use this file except in compliance with the License. You may obtain - a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, WITHOUT - WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the - License for the specific language governing permissions and limitations - under the License. - -.. _deployment-grouping-baremetal: - -Deployment Grouping for Baremetal Nodes -======================================= -One of the primary functionalities of the Undercloud Platform is the deployment -of baremetal nodes as part of site deployment and upgrade. This blueprint aims -to define how deployment strategies can be applied to the workflow during these -actions. - -Overview --------- -When Shipyard is invoked for a deploy_site or update_site action, there are -three primary stages: - -1. Preparation and Validation -2. Baremetal and Network Deployment -3. Software Deployment - -During the Baremetal and Network Deployment stage, the deploy_site or -update_site workflow (and perhaps other workflows in the future) invokes -Drydock to verify the site, prepare the site, prepare the nodes, and deploy the -nodes. Each of these steps is described in the `Drydock Orchestrator Readme`_ - -.. _Drydock Orchestrator Readme: https://git.openstack.org/cgit/openstack/airship-drydock/plain/drydock_provisioner/orchestrator/readme.md - -The prepare nodes and deploy nodes steps each involve intensive and potentially -time consuming operations on the target nodes, orchestrated by Drydock and -MAAS. These steps need to be approached and managed such that grouping, -ordering, and criticality of success of nodes can be managed in support of -fault tolerant site deployments and updates. - -For the purposes of this document `phase of deployment` refer to the prepare -nodes and deploy nodes steps of the Baremetal and Network deployment. - -Some factors that advise this solution: - -1. Limits to the amount of parallelization that can occur due to a centralized - MAAS system. -2. Faults in the hardware, preventing operational nodes. -3. Miswiring or configuration of network hardware. -4. Incorrect site design causing a mismatch against the hardware. -5. Criticality of particular nodes to the realization of the site design. -6. Desired configurability within the framework of the UCP declarative site - design. -7. Improved visibility into the current state of node deployment. -8. A desire to begin the deployment of nodes before the finish of the - preparation of nodes -- i.e. start deploying nodes as soon as they are ready - to be deployed. Note: This design will not achieve new forms of - task parallelization within Drydock; this is recognized as a desired - functionality. - -Solution --------- -Updates supporting this solution will require changes to Shipyard for changed -workflows and Drydock for the desired node targeting, and for retrieval of -diagnostic and result information. - -Deployment Strategy Document (Shipyard) ---------------------------------------- -To accommodate the needed changes, this design introduces a new -DeploymentStrategy document into the site design to be read and utilized -by the workflows for update_site and deploy_site. - -Groups -~~~~~~ -Groups are named sets of nodes that will be deployed together. The fields of a -group are: - -name - Required. The identifying name of the group. - -critical - Required. Indicates if this group is required to continue to additional - phases of deployment. - -depends_on - Required, may be empty list. Group names that must be successful before this - group can be processed. - -selectors - Required, may be empty list. A list of identifying information to indicate - the nodes that are members of this group. - -success_criteria - Optional. Criteria that must evaluate to be true before a group is considered - successfully complete with a phase of deployment. - -Criticality -''''''''''' -- Field: critical -- Valid values: true | false - -Each group is required to indicate true or false for the `critical` field. -This drives the behavior after the deployment of baremetal nodes. If any -groups that are marked as `critical: true` fail to meet that group's success -criteria, the workflow should halt after the deployment of baremetal nodes. A -group that cannot be processed due to a parent dependency failing will be -considered failed, regardless of the success criteria. - -Dependencies -'''''''''''' -- Field: depends_on -- Valid values: [] or a list of group names - -Each group specifies a list of depends_on groups, or an empty list. All -identified groups must complete successfully for the phase of deployment before -the current group is allowed to be processed by the current phase. - -- A failure (based on success criteria) of a group prevents any groups - dependent upon the failed group from being attempted. -- Circular dependencies will be rejected as invalid during document validation. -- There is no guarantee of ordering among groups that have their dependencies - met. Any group that is ready for deployment based on declared dependencies - will execute. Execution of groups is serialized - two groups will not deploy - at the same time. - -Selectors -''''''''' -- Field: selectors -- Valid values: [] or a list of selectors - -The list of selectors indicate the nodes that will be included in a group. -Each selector has four available filtering values: node_names, node_tags, -node_labels, and rack_names. Each selector is an intersection of this -critera, while the list of selectors is a union of the individual selectors. - -- Omitting a criterion from a selector, or using empty list means that criterion - is ignored. -- Having a completely empty list of selectors, or a selector that has no - criteria specified indicates ALL nodes. -- A collection of selectors that results in no nodes being identified will be - processed as if 100% of nodes successfully deployed (avoiding division by - zero), but would fail the minimum or maximum nodes criteria (still counts as - 0 nodes) -- There is no validation against the same node being in multiple groups, - however the workflow will not resubmit nodes that have already completed or - failed in this deployment to Drydock twice, since it keeps track of each node - uniquely. The success or failure of those nodes excluded from submission to - Drydock will still be used for the success criteria calculation. - -E.g.:: - - selectors: - - node_names: - - node01 - - node02 - rack_names: - - rack01 - node_tags: - - control - - node_names: - - node04 - node_labels: - - ucp_control_plane: enabled - -Will indicate (not really SQL, just for illustration):: - - SELECT nodes - WHERE node_name in ('node01', 'node02') - AND rack_name in ('rack01') - AND node_tags in ('control') - UNION - SELECT nodes - WHERE node_name in ('node04') - AND node_label in ('ucp_control_plane: enabled') - -Success Criteria -'''''''''''''''' -- Field: success_criteria -- Valid values: for possible values, see below - -Each group optionally contains success criteria which is used to indicate if -the deployment of that group is successful. The values that may be specified: - -percent_successful_nodes - The calculated success rate of nodes completing the deployment phase. - - E.g.: 75 would mean that 3 of 4 nodes must complete the phase successfully. - - This is useful for groups that have larger numbers of nodes, and do not - have critical minimums or are not sensitive to an arbitrary number of nodes - not working. - -minimum_successful_nodes - An integer indicating how many nodes must complete the phase to be considered - successful. - -maximum_failed_nodes - An integer indicating a number of nodes that are allowed to have failed the - deployment phase and still consider that group successful. - -When no criteria are specified, it means that no checks are done - processing -continues as if nothing is wrong. - -When more than one criterion is specified, each is evaluated separately - if -any fail, the group is considered failed. - - -Example Deployment Strategy Document -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This example shows a deployment strategy with 5 groups: control-nodes, -compute-nodes-1, compute-nodes-2, monitoring-nodes, and ntp-node. - -:: - - --- - schema: shipyard/DeploymentStrategy/v1 - metadata: - schema: metadata/Document/v1 - name: deployment-strategy - layeringDefinition: - abstract: false - layer: global - storagePolicy: cleartext - data: - groups: - - name: control-nodes - critical: true - depends_on: - - ntp-node - selectors: - - node_names: [] - node_labels: [] - node_tags: - - control - rack_names: - - rack03 - success_criteria: - percent_successful_nodes: 90 - minimum_successful_nodes: 3 - maximum_failed_nodes: 1 - - name: compute-nodes-1 - critical: false - depends_on: - - control-nodes - selectors: - - node_names: [] - node_labels: [] - rack_names: - - rack01 - node_tags: - - compute - success_criteria: - percent_successful_nodes: 50 - - name: compute-nodes-2 - critical: false - depends_on: - - control-nodes - selectors: - - node_names: [] - node_labels: [] - rack_names: - - rack02 - node_tags: - - compute - success_criteria: - percent_successful_nodes: 50 - - name: monitoring-nodes - critical: false - depends_on: [] - selectors: - - node_names: [] - node_labels: [] - node_tags: - - monitoring - rack_names: - - rack03 - - rack02 - - rack01 - - name: ntp-node - critical: true - depends_on: [] - selectors: - - node_names: - - ntp01 - node_labels: [] - node_tags: [] - rack_names: [] - success_criteria: - minimum_successful_nodes: 1 - -The ordering of groups, as defined by the dependencies (``depends-on`` -fields):: - - __________ __________________ - | ntp-node | | monitoring-nodes | - ---------- ------------------ - | - ____V__________ - | control-nodes | - --------------- - |_________________________ - | | - ______V__________ ______V__________ - | compute-nodes-1 | | compute-nodes-2 | - ----------------- ----------------- - -Given this, the order of execution could be: - -- ntp-node > monitoring-nodes > control-nodes > compute-nodes-1 > compute-nodes-2 -- ntp-node > control-nodes > compute-nodes-2 > compute-nodes-1 > monitoring-nodes -- monitoring-nodes > ntp-node > control-nodes > compute-nodes-1 > compute-nodes-2 -- and many more ... the only guarantee is that ntp-node will run some time - before control-nodes, which will run sometime before both of the - compute-nodes. Monitoring-nodes can run at any time. - -Also of note are the various combinations of selectors and the varied use of -success criteria. - -Deployment Configuration Document (Shipyard) --------------------------------------------- -The existing deployment-configuration document that is used by the workflows -will also be modified to use the existing deployment_strategy field to provide -the name of the deployment-straegy document that will be used. - -The default value for the name of the DeploymentStrategy document will be -``deployment-strategy``. - -Drydock Changes ---------------- - -API and CLI -~~~~~~~~~~~ -- A new API needs to be provided that accepts a node filter (i.e. selector, - above) and returns a list of node names that result from analysis of the - design. Input to this API will also need to include a design reference. - -- Drydock needs to provide a "tree" output of tasks rooted at the requested - parent task. This will provide the needed success/failure status for nodes - that have been prepared/deployed. - -Documentation -~~~~~~~~~~~~~ -Drydock documentation will be updated to match the introduction of new APIs - - -Shipyard Changes ----------------- - -API and CLI -~~~~~~~~~~~ -- The commit configdocs api will need to be enhanced to look up the - DeploymentStrategy by using the DeploymentConfiguration. -- The DeploymentStrategy document will need to be validated to ensure there are - no circular dependencies in the groups' declared dependencies (perhaps - NetworkX_). -- A new API endpoint (and matching CLI) is desired to retrieve the status of - nodes as known to Drydock/MAAS and their MAAS status. The existing node list - API in Drydock provides a JSON output that can be utilized for this purpose. - -Workflow -~~~~~~~~ -The deploy_site and update_site workflows will be modified to utilize the -DeploymentStrategy. - -- The deployment configuration step will be enhanced to also read the - deployment strategy and pass the information on a new xcom for use by the - baremetal nodes step (see below) -- The prepare nodes and deploy nodes steps will be combined to perform both as - part of the resolution of an overall ``baremetal nodes`` step. - The baremetal nodes step will introduce functionality that reads in the - deployment strategy (from the prior xcom), and can orchestrate the calls to - Drydock to enact the grouping, ordering and and success evaluation. - Note that Drydock will serialize tasks; there is no parallelization of - prepare/deploy at this time. - -Needed Functionality -'''''''''''''''''''' - -- function to formulate the ordered groups based on dependencies (perhaps - NetworkX_) -- function to evaluate success/failure against the success criteria for a group - based on the result list of succeeded or failed nodes. -- function to mark groups as success or failure (including failed due to - dependency failure), as well as keep track of the (if any) successful and - failed nodes. -- function to get a group that is ready to execute, or 'Done' when all groups - are either complete or failed. -- function to formulate the node filter for Drydock based on a group's - selectors -- function to orchestrate processing groups, moving to the next group (or being - done) when a prior group completes or fails. -- function to summarize the success/failed nodes for a group (primarily for - reporting to the logs at this time). - -Process -''''''' -The baremetal nodes step (preparation and deployment of nodes) will proceed as -follows: - -1. Each group's selector will be sent to Drydock to determine the list of - nodes that are a part of that group. - - - An overall status will be kept for each unique node (not started | - prepared | success | failure). - - When sending a task to Drydock for processing, the nodes associated with - that group will be sent as a simple `node_name` node filter. This will - allow for this list to exclude nodes that have a status that is not - congruent for the task being performed. - - - prepare nodes valid status: not started - - deploy nodes valid status: prepared - -2. In a processing loop, groups that are ready to be processed based on their - dependencies (and the success criteria of groups they are dependent upon) - will be selected for processing until there are no more groups that can be - processed. The processing will consist of preparing and then deploying the - group. - - - The selected group will be prepared and then deployed before selecting - another group for processing. - - Any nodes that failed as part of that group will be excluded from - subsequent deployment or preparation of that node for this deployment. - - - Excluding nodes that are already processed addresses groups that have - overlapping lists of nodes due to the group's selectors, and prevents - sending them to Drydock for re-processing. - - Evaluation of the success criteria will use the full set of nodes - identified by the selector. This means that if a node was previously - successfully deployed, that same node will count as "successful" when - evaluating the success criteria. - - - The success criteria will be evaluated after the group's prepare step and - the deploy step. A failure to meet the success criteria in a prepare step - will cause the deploy step for that group to be skipped (and marked as - failed). - - Any nodes that fail during the prepare step, will not be used in the - corresponding deploy step. - - Upon completion (success, partial success, or failure) of a prepare step, - the nodes that were sent for preparation will be marked in the unique list - of nodes (above) with their appropriate status: prepared or failure - - Upon completion of a group's deployment step, the nodes status will be - updated to their current status: success or failure. - -4. Before the end of the baremetal nodes step, following all eligible group - processing, a report will be logged to indicate the success/failure of - groups and the status of the individual nodes. Note that it is possible for - individual nodes to be left in `not started` state if they were only part of - groups that were never allowed to process due to dependencies and success - criteria. - -5. At the end of the baremetal nodes step, if any nodes that have failed - due to timeout, dependency failure, or success criteria failure and are - marked as critical will trigger an Airflow Exception, resulting in a failed - deployment. - -Notes: - -- The timeout values specified for the prepare nodes and deploy nodes steps - will be used to put bounds on the individual calls to Drydock. A failure - based on these values will be treated as a failure for the group; we need to - be vigilant on if this will lead to indeterminate states for nodes that mess - with further processing. (e.g. Timed out, but the requested work still - continued to completion) - -Example Processing -'''''''''''''''''' -Using the defined deployment strategy in the above example, the following is -an example of how it may process:: - - Start - | - | prepare ntp-node - | deploy ntp-node - V - | prepare control-nodes - | deploy control-nodes - V - | prepare monitoring-nodes - | deploy monitoring-nodes - V - | prepare compute-nodes-2 - | deploy compute-nodes-2 - V - | prepare compute-nodes-1 - | deploy compute-nodes-1 - | - Finish (success) - -If there were a failure in preparing the ntp-node, the following would be the -result:: - - Start - | - | prepare ntp-node - | deploy ntp-node - V - | prepare control-nodes - | deploy control-nodes - V - | prepare monitoring-nodes - | deploy monitoring-nodes - V - | prepare compute-nodes-2 - | deploy compute-nodes-2 - V - | prepare compute-nodes-1 - | deploy compute-nodes-1 - | - Finish (failed due to critical group failed) - -If a failure occurred during the deploy of compute-nodes-2, the following would -result:: - - Start - | - | prepare ntp-node - | deploy ntp-node - V - | prepare control-nodes - | deploy control-nodes - V - | prepare monitoring-nodes - | deploy monitoring-nodes - V - | prepare compute-nodes-2 - | deploy compute-nodes-2 - V - | prepare compute-nodes-1 - | deploy compute-nodes-1 - | - Finish (success with some nodes/groups failed) - -Schemas -~~~~~~~ -A new schema will need to be provided by Shipyard to validate the -DeploymentStrategy document. - -Documentation -~~~~~~~~~~~~~ -The Shipyard action documentation will need to include details defining the -DeploymentStrategy document (mostly as defined here), as well as the update to -the DeploymentConfiguration document to contain the name of the -DeploymentStrategy document. - - -.. _NetworkX: https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.algorithms.dag.topological_sort.html diff --git a/docs/source/blueprints/node-teardown.rst b/docs/source/blueprints/node-teardown.rst deleted file mode 100644 index cfad769f..00000000 --- a/docs/source/blueprints/node-teardown.rst +++ /dev/null @@ -1,559 +0,0 @@ -.. - Copyright 2018 AT&T Intellectual Property. - All Rights Reserved. - - Licensed under the Apache License, Version 2.0 (the "License"); you may - not use this file except in compliance with the License. You may obtain - a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, WITHOUT - WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the - License for the specific language governing permissions and limitations - under the License. - -.. _node-teardown: - -Undercloud Node Teardown -======================== - -When redeploying a physical host (server) using the Undercloud Platform(UCP), -it is necessary to trigger a sequence of steps to prevent undesired behaviors -when the server is redeployed. This blueprint intends to document the -interaction that must occur between UCP components to teardown a server. - -Overview --------- -Shipyard is the entrypoint for UCP actions, including the need to redeploy a -server. The first part of redeploying a server is the graceful teardown of the -software running on the server; specifically Kubernetes and etcd are of -critical concern. It is the duty of Shipyard to orchestrate the teardown of the -server, followed by steps to deploy the desired new configuration. This design -covers only the first portion - node teardown - -Shipyard node teardown Process -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -#. (Existing) Shipyard receives request to redeploy_server, specifying a target - server. -#. (Existing) Shipyard performs preflight, design reference lookup, and - validation steps. -#. (New) Shipyard invokes Promenade to decommission a node. -#. (New) Shipyard invokes Drydock to destroy the node - setting a node - filter to restrict to a single server. -# (New) Shipyard invokes Promenade to remove the node from the Kubernetes - cluster. - -Assumption: -node_id is the hostname of the server, and is also the identifier that both -Drydock and Promenade use to identify the appropriate parts - hosts and k8s -nodes. This convention is set by the join script produced by promenade. - -Drydock Destroy Node --------------------- -The API/interface for destroy node already exists. The implementation within -Drydock needs to be developed. This interface will need to accept both the -specified node_id and the design_id to retrieve from Deckhand. - -Using the provided node_id (hardware node), and the design_id, Drydock will -reset the hardware to a re-provisionable state. - -By default, all local storage should be wiped (per datacenter policy for -wiping before re-use). - -An option to allow for only the OS disk to be wiped should be supported, such -that other local storage is left intact, and could be remounted without data -loss. e.g.: --preserve-local-storage - -The target node should be shut down. - -The target node should be removed from the provisioner (e.g. MaaS) - -Responses -~~~~~~~~~ -The responses from this functionality should follow the pattern set by prepare -nodes, and other Drydock functionality. The Drydock status responses used for -all async invocations will be utilized for this functionality. - -Promenade Decommission Node ---------------------------- -Performs steps that will result in the specified node being cleanly -disassociated from Kubernetes, and ready for the server to be destroyed. -Users of the decommission node API should be aware of the long timeout values -that may occur while awaiting promenade to complete the appropriate steps. -At this time, Promenade is a stateless service and doesn't use any database -storage. As such, requests to Promenade are synchronous. - -.. code:: json - - POST /nodes/{node_id}/decommission - - { - rel : "design", - href: "deckhand+https://{{deckhand_url}}/revisions/{{revision_id}}/rendered-documents", - type: "application/x-yaml" - } - -Such that the design reference body is the design indicated when the -redeploy_server action is invoked through Shipyard. - -Query Parameters: - -- drain-node-timeout: A whole number timeout in seconds to be used for the - drain node step (default: none). In the case of no value being provided, - the drain node step will use its default. -- drain-node-grace-period: A whole number in seconds indicating the - grace-period that will be provided to the drain node step. (default: none). - If no value is specified, the drain node step will use its default. -- clear-labels-timeout: A whole number timeout in seconds to be used for the - clear labels step. (default: none). If no value is specified, clear labels - will use its own default. -- remove-etcd-timeout: A whole number timeout in seconds to be used for the - remove etcd from nodes step. (default: none). If no value is specified, - remove-etcd will use its own default. -- etcd-ready-timeout: A whole number in seconds indicating how long the - decommission node request should allow for etcd clusters to become stable - (default: 600). - -Process -~~~~~~~ -Acting upon the node specified by the invocation and the design reference -details: - -#. Drain the Kubernetes node. -#. Clear the Kubernetes labels on the node. -#. Remove etcd nodes from their clusters (if impacted). - - if the node being decommissioned contains etcd nodes, Promenade will - attempt to gracefully have those nodes leave the etcd cluster. -#. Ensure that etcd cluster(s) are in a stable state. - - Polls for status every 30 seconds up to the etcd-ready-timeout, or the - cluster meets the defined minimum functionality for the site. - - A new document: promenade/EtcdClusters/v1 that will specify details about - the etcd clusters deployed in the site, including: identifiers, - credentials, and thresholds for minimum functionality. - - This process should ignore the node being torn down from any calculation - of health -#. Shutdown the kubelet. - - If this is not possible because the node is in a state of disarray such - that it cannot schedule the daemonset to run, this step may fail, but - should not hold up the process, as the Drydock dismantling of the node - will shut the kubelet down. - -Responses -~~~~~~~~~ -All responses will be form of the UCP Status response. - -- Success: Code: 200, reason: Success - - Indicates that all steps are successful. - -- Failure: Code: 404, reason: NotFound - - Indicates that the target node is not discoverable by Promenade. - -- Failure: Code: 500, reason: DisassociateStepFailure - - The details section should detail the successes and failures further. Any - 4xx series errors from the individual steps would manifest as a 500 here. - -Promenade Drain Node --------------------- -Drain the Kubernetes node for the target node. This will ensure that this node -is no longer the target of any pod scheduling, and evicts or deletes the -running pods. In the case of notes running DaemonSet manged pods, or pods -that would prevent a drain from occurring, Promenade may be required to provide -the `ignore-daemonsets` option or `force` option to attempt to drain the node -as fully as possible. - -By default, the drain node will utilize a grace period for pods of 1800 -seconds and a total timeout of 3600 seconds (1 hour). Clients of this -functionality should be prepared for a long timeout. - -.. code:: json - - POST /nodes/{node_id}/drain - -Query Paramters: - -- timeout: a whole number in seconds (default = 3600). This value is the total - timeout for the kubectl drain command. -- grace-period: a whole number in seconds (default = 1800). This value is the - grace period used by kubectl drain. Grace period must be less than timeout. - -.. note:: - - This POST has no message body - -Example command being used for drain (reference only) -`kubectl drain --force --timeout 3600s --grace-period 1800 --ignore-daemonsets --delete-local-data n1` -https://git.openstack.org/cgit/openstack/airship-promenade/tree/promenade/templates/roles/common/usr/local/bin/promenade-teardown - -Responses -~~~~~~~~~ -All responses will be form of the UCP Status response. - -- Success: Code: 200, reason: Success - - Indicates that the drain node has successfully concluded, and that no pods - are currently running - -- Failure: Status response, code: 400, reason: BadRequest - - A request was made with parameters that cannot work - e.g. grace-period is - set to a value larger than the timeout value. - -- Failure: Status response, code: 404, reason: NotFound - - The specified node is not discoverable by Promenade - -- Failure: Status response, code: 500, reason: DrainNodeError - - There was a processing exception raised while trying to drain a node. The - details section should indicate the underlying cause if it can be - determined. - -Promenade Clear Labels ----------------------- -Removes the labels that have been added to the target kubernetes node. - -.. code:: json - - POST /nodes/{node_id}/clear-labels - -Query Parameters: - -- timeout: A whole number in seconds allowed for the pods to settle/move - following removal of labels. (Default = 1800) - -.. note:: - - This POST has no message body - -Responses -~~~~~~~~~ -All responses will be form of the UCP Status response. - -- Success: Code: 200, reason: Success - - All labels have been removed from the specified Kubernetes node. - -- Failure: Code: 404, reason: NotFound - - The specified node is not discoverable by Promenade - -- Failure: Code: 500, reason: ClearLabelsError - - There was a failure to clear labels that prevented completion. The details - section should provide more information about the cause of this failure. - -Promenade Remove etcd Node -~~~~~~~~~~~~~~~~~~~~~~~~~~ -Checks if the node specified contains any etcd nodes. If so, this API will -trigger that etcd node to leave the associated etcd cluster. - -POST /nodes/{node_id}/remove-etcd - - { - rel : "design", - href: "deckhand+https://{{deckhand_url}}/revisions/{{revision_id}}/rendered-documents", - type: "application/x-yaml" - } - -Query Parameters: - -- timeout: A whole number in seconds allowed for the removal of etcd nodes - from the targe node. (Default = 1800) - -Responses -~~~~~~~~~ -All responses will be form of the UCP Status response. - -- Success: Code: 200, reason: Success - - All etcd nodes have been removed from the specified node. - -- Failure: Code: 404, reason: NotFound - - The specified node is not discoverable by Promenade - -- Failure: Code: 500, reason: RemoveEtcdError - - There was a failure to remove etcd from the target node that prevented - completion within the specified timeout, or that etcd prevented removal of - the node because it would result in the cluster being broken. The details - section should provide more information about the cause of this failure. - - -Promenade Check etcd -~~~~~~~~~~~~~~~~~~~~ -Retrieves the current interpreted state of etcd. - -GET /etcd-cluster-health-statuses?design_ref={the design ref} - -Where the design_ref parameter is required for appropriate operation, and is in -the same format as used for the join-scripts API. - -Query Parameters: - -- design_ref: (Required) the design reference to be used to discover etcd - instances. - -Responses -~~~~~~~~~ -All responses will be form of the UCP Status response. - -- Success: Code: 200, reason: Success - - The status of each etcd in the site will be returned in the details section. - Valid values for status are: Healthy, Unhealthy - -https://github.com/att-comdev/ucp-integration/blob/master/docs/source/api-conventions.rst#status-responses - -.. code:: json - - { "...": "... standard status response ...", - "details": { - "errorCount": {{n}}, - "messageList": [ - { "message": "Healthy", - "error": false, - "kind": "HealthMessage", - "name": "{{the name of the etcd service}}" - }, - { "message": "Unhealthy" - "error": false, - "kind": "HealthMessage", - "name": "{{the name of the etcd service}}" - }, - { "message": "Unable to access Etcd" - "error": true, - "kind": "HealthMessage", - "name": "{{the name of the etcd service}}" - } - ] - } - ... - } - -- Failure: Code: 400, reason: MissingDesignRef - - Returned if the design_ref parameter is not specified - -- Failure: Code: 404, reason: NotFound - - Returned if the specified etcd could not be located - -- Failure: Code: 500, reason: EtcdNotAccessible - - Returned if the specified etcd responded with an invalid health response - (Not just simply unhealthy - that's a 200). - - -Promenade Shutdown Kubelet --------------------------- -Shuts down the kubelet on the specified node. This is accomplished by Promenade -setting the label `promenade-decomission: enabled` on the node, which will -trigger a newly-developed daemonset to run something like: -`systemctl disable kubelet && systemctl stop kubelet`. -This daemonset will effectively sit dormant until nodes have the appropriate -label added, and then perform the kubelet teardown. - -.. code:: json - - POST /nodes/{node_id}/shutdown-kubelet - -.. note:: - - This POST has no message body - -Responses -~~~~~~~~~ -All responses will be form of the UCP Status response. - -- Success: Code: 200, reason: Success - - The kubelet has been successfully shutdown - -- Failure: Code: 404, reason: NotFound - - The specified node is not discoverable by Promenade - -- Failure: Code: 500, reason: ShutdownKubeletError - - The specified node's kubelet fails to shutdown. The details section of the - status response should contain reasonable information about the source of - this failure - -Promenade Delete Node from Cluster ----------------------------------- -Updates the Kubernetes cluster, removing the specified node. Promenade should -check that the node is drained/cordoned and has no labels other than -`promenade-decomission: enabled`. In either of these cases, the API should -respond with a 409 Conflict response. - -.. code:: json - - POST /nodes/{node_id}/remove-from-cluster - -.. note:: - - This POST has no message body - -Responses -~~~~~~~~~ -All responses will be form of the UCP Status response. - -- Success: Code: 200, reason: Success - - The specified node has been removed from the Kubernetes cluster. - -- Failure: Code: 404, reason: NotFound - - The specified node is not discoverable by Promenade - -- Failure: Code: 409, reason: Conflict - - The specified node cannot be deleted due to checks that the node is - drained/cordoned and has no labels (other than possibly - `promenade-decomission: enabled`). - -- Failure: Code: 500, reason: DeleteNodeError - - The specified node cannot be removed from the cluster due to an error from - Kubernetes. The details section of the status response should contain more - information about the failure. - - -Shipyard Tag Releases ---------------------- -Shipyard will need to mark Deckhand revisions with tags when there are -successful deploy_site or update_site actions to be able to determine the last -known good design. This is related to issue 16 for Shipyard, which utilizes the -same need. - -.. note:: - - Repeated from https://github.com/att-comdev/shipyard/issues/16 - - When multiple configdocs commits have been done since the last deployment, - there is no ready means to determine what's being done to the site. Shipyard - should reject deploy site or update site requests that have had multiple - commits since the last site true-up action. An option to override this guard - should be allowed for the actions in the form of a parameter to the action. - - The configdocs API should provide a way to see what's been changed since the - last site true-up, not just the last commit of configdocs. This might be - accommodated by new deckhand tags like the 'commit' tag, but for - 'site true-up' or similar applied by the deploy and update site commands. - -The design for issue 16 includes the bare-minimum marking of Deckhand -revisions. This design is as follows: - -Scenario -~~~~~~~~ -Multiple commits occur between site actions (deploy_site, update_site) - those -actions that attempt to bring a site into compliance with a site design. -When this occurs, the current system of being able to only see what has changed -between committed and the the buffer versions (configdocs diff) is insufficient -to be able to investigate what has changed since the last successful (or -unsuccessful) site action. -To accommodate this, Shipyard needs several enhancements. - -Enhancements -~~~~~~~~~~~~ - -#. Deckhand revision tags for site actions - - Using the tagging facility provided by Deckhand, Shipyard will tag the end - of site actions. - Upon completing a site action successfully tag the revision being used with - the tag site-action-success, and a body of dag_id: - - Upon completion of a site action unsuccessfully, tag the revision being used - with the tag site-action-failure, and a body of dag_id: - - The completion tags should only be applied upon failure if the site action - gets past document validation successfully (i.e. gets to the point where it - can start making changes via the other UCP components) - - This could result in a single revision having both site-action-success and - site-action-failure if a later re-invocation of a site action is successful. - -#. Check for intermediate committed revisions - - Upon running a site action, before tagging the revision with the site action - tag(s), the dag needs to check to see if there are committed revisions that - do not have an associated site-action tag. If there are any committed - revisions since the last site action other than the current revision being - used (between them), then the action should not be allowed to proceed (stop - before triggering validations). For the calculation of intermediate - committed revisions, assume revision 0 if there are no revisions with a - site-action tag (null case) - - If the action is invoked with a parameter of - allow-intermediate-commits=true, then this check should log that the - intermediate committed revisions check is being skipped and not take any - other action. - -#. Support action parameter of allow-intermediate-commits=true|false - - In the CLI for create action, the --param option supports adding parameters - to actions. The parameters passed should be relayed by the CLI to the API - and ultimately to the invocation of the DAG. The DAG as noted above will - check for the presense of allow-intermediate-commits=true. This needs to be - tested to work. - -#. Shipyard needs to support retrieving configdocs and rendered documents for - the last successful site action, and last site action (successful or not - successful) - - --successful-site-action - --last-site-action - These options would be mutually exclusive of --buffer or --committed - -#. Shipyard diff (shipyard get configdocs) - - Needs to support an option to do the diff of the buffer vs. the last - successful site action and the last site action (succesful or not - successful). - - Currently there are no options to select which versions to diff (always - buffer vs. committed) - - support: - --base-version=committed | successful-site-action | last-site-action (Default = committed) - --diff-version=buffer | committed | successful-site-action | last-site-action (Default = buffer) - - Equivalent query parameters need to be implemented in the API. - -Because the implementation of this design will result in the tagging of -successful site-actions, Shipyard will be able to determine the correct -revision to use while attempting to teardown a node. - -If the request to teardown a node indicates a revision that doesn't exist, the -command to do so (e.g. redeploy_server) should not continue, but rather fail -due to a missing precondition. - -The invocation of the Promenade and Drydock steps in this design will utilize -the appropriate tag based on the request (default is successful-site-action) to -determine the revision of the Deckhand documents used as the design-ref. - -Shipyard redeploy_server Action -------------------------------- -The redeploy_server action currently accepts a target node. Additional -supported parameters are needed: - -#. preserve-local-storage=true which will instruct Drydock to only wipe the - OS drive, and any other local storage will not be wiped. This would allow - for the drives to be remounted to the server upon re-provisioning. The - default behavior is that local storage is not preserved. - -#. target-revision=committed | successful-site-action | last-site-action - This will indicate which revision of the design will be used as the - reference for what should be re-provisioned after the teardown. - The default is successful-site-action, which is the closest representation - to the last-known-good state. - -These should be accepted as parameters to the action API/CLI and modify the -behavior of the redeploy_server DAG. \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index e21531ca..4ae4c8e8 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -52,7 +52,6 @@ Conventions and Standards :maxdepth: 3 conventions - blueprints/blueprints dev-getting-started