ZerOps - A Self-Healing Platform
Telecommunication service and network operators are confronted with rising expectations towards availability, performance, and guaranteed QoS. The complexity of modern IT infrastructures has increased to a point, where traditional IT administration procedures fail to holistically ensure the dependability of the systems.
At the same time, various approaches around artificial intelligence (AI) are currently revolutionizing domains like medicine, manufacturing, or autonomous driving. This strongly motivates the utilization of AI for the autonomous management of highly complex IT systems (AIOps).
Researchers and global companies recognized this potential and started to work on AIOps solutions. Since 2015, the CIT department joint forces with industrial partners (Deutsche Telecom and Huawei Technologies Co., Ltd) to establish a joint research lab, working on solutions for anomaly detection/classification, predictive fault tolerance and auto-remediation. Thereby, the Self-Healing Cloud Platform ZerOps was developed, which is up to the current point constantly adjusted and enhanced by members of the CIT group.
- © TU-Berlin
The vision for ZerOps is to provide a scalable platform for monitoring, hierarchical in-place data analytics, and predictive system remediation. The term in-place refers to the explicit design goal to analyze collected data directly at the data source through streaming-based machine learning (ML) algorithms. ZerOps can be integrated in existing cloud infrastructures with. The second major design goal of ZerOps is a modular and flexible data analysis pipeline that can be assembled from multiple interchangeable elements. This allows customization to different infrastructure use cases, but also supports easy-to-use experimentation with new algorithmic approaches for research purposes. Due to the decentralized deployment, the data analysis is co-located with regular system parts. Therefore, its resource usage has to be limited to a certain percentage of the available resources. Furthermore, ZerOps incorporates streaming analytics as well as event aggregations to determine anomaly root causes and perform further advanced anomaly situation analyses. By the integration of unsupervised anomaly detection, ZerOps is able to detect unknown problems as well as already known and learned anomalies. A decentralized ML model repository enables transfer learning to overcome cold-start problems for dynamic IT-infrastructure components. ZerOps also supports automatic hyperparameter selection of ML algorithms.