(armada) Chart Time Metrics

Change-Id: I121d8fcf050a83cbcf01a14c1543d11a0b04ea2a
This commit is contained in:
Samuel Pilla 2019-06-24 08:47:33 -05:00
parent 987eacad79
commit c89117539e
1 changed files with 158 additions and 0 deletions

View File

@ -0,0 +1,158 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=======================================
Time Performance Metrics for Each Chart
=======================================
Allow time performance metrics on charts including deployment time, upgrade
time, wait time, test time, and consumed time for docs or resources, if
applicable.
Problem description
===================
There are currently no time metrics within Armada for chart deployments,
upgrades, tests, or other actions. This can cause issues in that there is no
known time for deployment of an environment, potentially restricting
deployment or upgrade periods for charts. By adding time metrics for the
charts, this will allow for better predictability of deployments and upgrades
as well as show when charts are acting not as intended.
Use Cases
---------
Knowing how long a chart takes to deploy or upgrade can streamline these
processes in future deployements or upgrades. It allows for predictable chart
deployment and upgrade times as well as finding inconsistencies within those
deployments and upgrades, likely pinpointing which chart(s) is causing errors.
Proposed change
===============
Add time metrics to the `ChartBuilder`, `ChartDeploy`, and `ChartDelete`
classes. The timer will be built in python library `time` which will
then be written to the logs for use or analysis.
These metrics include the full deployment, upgrade, wait, install, and delete
time for charts through Armada. These will be logged with a date and timestamp
with the chart name and action performed, such as the following::
Ingress DEPLOYMENT start: 2019-06-25 12:34:56 UTC
...
Ingress DEPLOYMENT complete: 2019-06-25 13:57:09 UTC
Ingress DEPLOYMENT duration: 01:22:13
As shown, the logs will show the chart name, the action the chart is performing,
the status of the action, and the datetime of the stage along with the duration
at the end. In case of an error, the `complete` will be replaced with `error`.
In order to log these metrics, changes to the deployment files will need to be
made, adding in lines to create the timestamps needed and then log the start,
completion or error, and duration times for the chart's action.
Example:
chart_deploy.py::
def execute(self, chart, cg_test_all_charts, prefix, known_releases):
namespace = chart.get('namespace')
release = chart.get('release')
release_name = r.release_prefixer(prefix, release)
LOG.info('Processing Chart, release=%s', release_name)
start_time = time.time()
...
LOG.info('Chart deployment/update completed in %s' % \
time.strftime('%H:%M:%S', time.gmtime(time.time() - start_time)))
start_time = time.time()
# Wait
timer = int(round(deadline - time.time()))
chart_wait.wait(timer)
LOG.info('Chart wait completed in %s' % \
time.strftime('%H:%M:%S', time.gmtime(time.time() - start_time)))
start_time = time.time()
# Test
just_deployed = ('install' in result) or ('upgrade' in result)
...
if run_test:
self._test_chart(release_name, test_handler)
LOG.info('Chart test completed in %s' % \
time.strftime('%H:%M:%S', time.gmtime(time.time() - start_time)))
...
Alternatives
------------
1. A simplistic alternative is to merely log time stamps for each action which
occurs on a chart. While similar the same as the proposed change, it would not
show an elapsed time but just start and end points.
2. Another alternative is to use the `datetime` library instead of the `time`
library. This allows for very similar functionality in getting the elapsed
time for chart deployment, update, wait, test, etc. It is slightly more
effort to get the `timedelta` object produced by comparing two `datetime`
objects to a string format to put into the log.
3. A third alternative is to use the Prometheus metrics through Openstack Helm.
The Prometheus config file currently has it scrape the cAdvisor endpoint to
retrieve metrics. These metrics could be used to show the starting time of
chart deployments based on the containers. The 'container_start_time_seconds'
metric will show the epoch timestamp for the container the chart is running,
which can be converted to a normal timestamp. In order to grab the scraped
metrics, an HTTP request as follows can be used::
curl http://127.0.0.1:9090/metrics
Unfortunately these metrics do not include anything that would easily show
when a chart was finished. A possibility would to grab the next chart's
`container_start_time_seconds` timestamp and compare it to the previous, thus
providing a rough estimate for the time performance of a chart deployment.
However, for upgrades, waits, and tests times, it may prove too complex
from the Prometheus scraped metrics to get accurate data since this returns
only data for the starting of the containers.
Security Impact
---------------
None
Notifications Impact
--------------------
Extra notification diplaying deployment or upgrade time
Other End User Impact
---------------------
None
Performance Impact
------------------
None
Other Deployer Impact
---------------------
None
Implementation
==============
Assignee(s)
-----------
Work Items
----------
Dependencies
============
None
Documentation Impact
====================
None