Automation of Cloud Resilience Control
The resilience of cloud platforms (eg, Deutsche Telekom OTC, Amazon AWS, and Microsoft Azure) is acquiring an increased relevance since society is relying more and more on complex software systems. Resiliency is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure. This cooperation research project requests for solutions which can establish a link between anomaly detection, recovery procedures, and recommender systems.
Anomaly detection via distributed tracing: Cloud platforms such as Openstack can trace user requests using tracing technologies, eg, OSprofiler or Zipkin. Distributed tracing enables to record detailed insights on how a cloud platform reacts to requests by sending events to a central system. Incoming events can be continuously processed and matched against patterns associated with errors and failures.
Recovery procedures. Infrastructure as Code, such as SaltStack and Ansible, are becoming indispensable solutions for cloud management. While many organizations still engineer their infrastructure largely by hand, these new solutions enable predefined procedures to automatically recover cloud platforms in case of problems.
Recommender systems. While the selection of recovery procedures in response to specific event patterns can be done by humans, the complexity and scale of modern cloud platforms require the use of intelligent solutions to recommend to operators the most relevant procedures to consider executing in case of errors. The maturity of current technologies provides the fundamental pieces of functionality to enable cloud platforms to autonomously recover from failures.