(armada) Chart Time Metrics

Change-Id: I121d8fcf050a83cbcf01a14c1543d11a0b04ea2a
This commit is contained in:
Samuel Pilla 2019-06-24 08:47:33 -05:00
parent 987eacad79
commit 2dfc155f48
1 changed files with 163 additions and 0 deletions

View File

@ -0,0 +1,163 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=======================================
Time Performance Metrics for Each Chart
=======================================
Allow time performance metrics on charts including deployment time, upgrade
time, wait time, test time, and consumed time for docs or resources, if
applicable.
Problem description
===================
There are currently no time metrics within Armada for chart deployments,
upgrades, tests, or other actions. This can cause issues in that there is no
known time for deployment of an environment, potentially restricting
deployment or upgrade periods for charts. By adding time metrics for the
charts, this will allow for better predictability of deployments and upgrades
as well as show when charts are acting not as intended.
Use Cases
---------
Knowing how long a chart takes to deploy or upgrade can streamline these
processes in future deployements or upgrades. It allows for predictable chart
deployment and upgrade times as well as finding inconsistencies within those
deployments and upgrades, likely pinpointing which chart(s) is causing errors.
Proposed change
===============
Add time metrics to the `ChartBuilder`, `ChartDeploy`, and `ChartDelete`
classes. The timer will be the built in python library `time` which will
then be written to the logs for use or analysis.
These metrics include the full deployment, upgrade, wait, install, and delete
time for charts through Armada. These will be logged with a date & timestamp
with the chart name and action performed, such as the following::
Ingress DEPLOYMENT start: 2019-06-25 12:34:56 UTC
...
Ingress DEPLOYMENT complete: 2019-06-25 13:57:09 UTC
Ingress DEPLOYMENT duration: 01:22:13
As shown, the metrics will show the chart, the action the chart is performing,
the stage of the action, and the datetime of the stage along with the duration
at the end. In case of an error, the `complete` will be replaced with `error`.
In order to log these metrics, changes to the deployment files will need to be
made, adding in lines to create the timestamps needed and then log the start,
completion or error, and durtion times for the chart's action.
Example:
chart_deploy.py::
def execute(self, chart, cg_test_all_charts, prefix, known_releases):
namespace = chart.get('namespace')
release = chart.get('release')
release_name = r.release_prefixer(prefix, release)
LOG.info('Processing Chart, release=%s', release_name)
start_time = time.time()
...
LOG.info('Chart deployment/update completed in %s' % \
time.strftime('%H:%M:%S', time.gmtime(time.time() - start_time)))
start_time = time.time()
# Wait
timer = int(round(deadline - time.time()))
chart_wait.wait(timer)
LOG.info('Chart wait completed in %s' % \
time.strftime('%H:%M:%S', time.gmtime(time.time() - start_time)))
start_time = time.time()
#Test
just_deployed = ('install' in result) or ('upgrade' in result)
...
if run_test:
self._test_chart(release_name, test_handler)
LOG.info('Chart test completed in %s' % \
time.strftime('%H:%M:%S', time.gmtime(time.time() - start_time)))
...
Alternatives
------------
1. A simplistic alternative is to merely log time stamps for each action which
occurs on a chart. While almost the same as the proposed change, it doesn't
show an elapsed time but just start and end points.
2. Another alternative is to use the `datetime` library instead of the `time`
library. This allows for very similar functionality in getting the elapsed
time for chart deployment, update, wait, test, etc. It is slightly more
effort to get the `timedelta` object produced by comparing two `datetime`
objects to a string format to put into the log.
3. A third alternative is to use the Prometheus metrics through Openstack Helm.
The Prometheus config file currently has it scrape the cAdvisor endpoint to
retrieve metrics. These metrics could be use to show the starting time of
chart deployments based on the containers. The 'container_start_time_seconds'
metric will show the epoch timestamp for the container the chart is running,
which can be converted normal timestamp. In order to grab the scraped metrics,
a HTTP request as follows can be used::
curl http://127.0.0.1:9090/metrics
Unfortunatly these metrics do not include anything that would easily show
when a chart was finished. A possibility would to grab the next chart's
'container_start_time_seconds' timestamp and compare it to the previous, thus
getting a rough estimate for the time performance of a chart deployment.
However, for upgrades, waits, and tests times, it may prove too complex
from the Prometheus scraped metrics to get accurate data since this returns
only data for the starting of the containers.
Security Impact
---------------
None
Notifications Impact
--------------------
Extra notification diplaying deployment or upgrade time
Other End User Impact
---------------------
None
Performance Impact
------------------
None
Other Deployer Impact
---------------------
None
Implementation
==============
Assignee(s)
-----------
Work Items
----------
Dependencies
============
None
Documentation Impact
====================
None
References
==========
TODO