Artificial Intelligence for Anomaly Detection and Root Cause Analysis in Cloud Systems
Setting at disposal the wide range of storage, computing and networking instances, a distributed IT services provider must adhere to the versatile demands of the customers under various conditions. Both parties have a common expectation of a stable, reliable, continuous service requiring the best possible maintenance and troubleshooting anytime and anywhere. Thus stability analysis of the system is of crucial importance.
One way to access the stability of a system, as a first step towards ensuring reliable service is anomaly detection. The IT systems provide a variety of monitoring components (e.g metrics, traces and logs) used to access the state of the system. They all contain rich information that can be utilized for anomaly detection. Our research shows that we can efficiently combine the complementary information from all of the modalities from a distributed system environment and use it to provide a better representation of the interactions inside the systems.
To do so, we use the latest achievements in the area of Machine Learning and Artificial Intelligence. More specifically, we combine various strategies, from both the traditional (e.g rule mining) and advanced learning methodologies (e.g. Deep Learning) to provide efficient, yet accurate models for the problem of anomaly detection in distributed IT systems. These methods are rigorously tested and proof they can be used within distributed environments.
In this context, we are very interested in the exploitation of methods and approaches for post-analysis of our methods. We believe and actively work on detecting potential sources of the anomaly within the system and performing root cause analysis. Our research falls within the area of the emerging field of AIOps (Artificial Intelligence for IT operations) that is expected to have a very significant role in the future.