Optimization and Fault Tolerance in Distributed Systems
Living in a data-driven interconnected world undeniable increases the relevance of distributed IT systems. The modern technological discurs is dominated by terms, such as Internet of Things (IoT), smart cities, sensor networks, 5G, autonomous transportation, Industry 4.0 and many more, and their necessity for IT. While bringing the potential great technologies like autonomous driving, virtual reality or remote surgery - just to state a few - their increased complexity and distribution breaks traditional system operation concepts. In general, systems with an increased numbers of highly interconnected components are hard to operate by human experts alone.
Recent advances in data analysis embodied in the term of Artificial Intelligence (AI) and machine learning (ML) demonstrate to be very useful for aiding expert practitioners from various fields. They allow to gain new insights into various problem fields and provide the opportunity for autonomous solutions. In the context of distributed IT systems their application is embedded in the term of AIOps (AI for system operations). The current goal of such systems is to support human experts with the task of operating large and distributed IT systems. Depending on the concrete application field, they are focusing on diverse objectives like high availability, fault tolerance, optimization or energy efficiency. A large body of researchers and global companies have recognized the AIOps approach as very relevant and are investing a huge amount of effort and resources to improve the area.
Since 2015, the CIT department has worked with industrial partners (Huawei Technologies Co. Ltd, Deutsche Telekom, Siemens AG, etc.) to establish joint research projects, working on AIOps solutions for anomaly detection/classification, predictive fault tolerance and auto-remediation. Several years of follow up work resulted in many innovative methods with a proven record of published papers on top-ranked scientific journals and conferences.
Currently, we have several projects focusing on developing novel methods for a variety of system monitoring data, reducing the cold-start problem of deploying models in production environments and approaches for shedding insights into the decision process of “black-box” AI models. Our research endeavours stretch further towards the impactful areas of edge clouds, fog computing and IoT. More specifically, the additional complications that arise from distributed localization of system components are opening a whole new spectrum of problems and research opportunities we actively work on exploring.
Topics of interest
- Internet of Things, Big Data Analytics
- Scalability, Availability and Reliability
- Service Level Agreements, Quality of Service
- Profiling, Performance Modeling, Testing, Simulations, Testbeds
- Virtualization, Network-as-a-Service, Infrastructure-as-a-Service
- AIOps for Edge and Fog Computing Infrastructures
- Detection and Description of Anomalies
- Explainable AI in the Context of Root Cause Analysis for Distributed IT Systems
- Data-driven Remediation and Recovery Strategies
Our current team
- © Copyright??
- Thorsten , Florian , Jasmin , Alexander , Li , Sasho , Soeren 
We mostly adhere to empirical research methodology and primarily relying on and contributing to open-source tools. Evaluations are done against relevant open-source systems like OpenStack, Kubernetes or openVSwitch. We have access to several state-of-the-art infrastructures, including two commodity clusters consisting of 20 and 200 nodes, the faculty's HPC cluster, two OpeStack private clouds as well as an edge cloud based on OpenStack and a set of Rasbery Pi IoT devices and sensors.
Several extensive collaboration with industrial partners provides us with great opportunities to work with practical real-world systems on highly relevant solutions. As such we are able to gather a wide range of practical experience and tools that are directly applicable to the industry. This allows us to raise significant practical research questions.
Our highly motivated and supportive team spirit results in many opportunities for collaborations and joint research.
ZerOps - A Self-Healing Platform
The vision for ZerOps is to provide a scalable platform for monitoring, hierarchical in-place data analytics, and predictive system remediation. The term in-place refers to the explicit design goal to analyze collected data directly at the data source through streaming-based machine learning (ML) algorithms. Learn more... 
Anomaly Classification and Auto Remediation
This project aims at the remediation of recurring problems, which means that routine tasks of fixing those should be taken off from system administrators and allowing them to work on meaningful and interesting projects instead. Learn more... 
AIOps on Edge Computing Environments
The challenges of Edge and Fog Computing environments create a paradox situation: a vulnerable infrastructure has a decisive impact on our everyday life, as it delivers crucial data for i.e. autonomous driving, connected healthcare or other critical processes. Managing this complexity to oversee the entire system and react with short intervals to comply with the high requirements on Edge Computing surpasses the ability of human experts. Concerning this, a new concept of combining AI methods to operate such complex infrastructures (AIOps) is on the rise. Learn more... 
IT Log anomaly detection model generalization on inference stage
This research seeks to explore an end-to-end method for detecting anomalies based on logs within a variety of IT environments. This end-to-end process aims at the simplification of the monitoring and operation of the systems. Our target is to realize a general solution for anomaly detection in unknown systems. Learn more... 
Automation of Cloud Resilience Control
The resilience of cloud platforms (e.g., Deutsche Telekom OTC, Amazon AWS, and Microsoft Azure) is acquiring an increased relevance since society is relying more and more on complex software systems. Resiliency is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure. This cooperation research project requests for solutions which can establish a link between anomaly detection, recovery procedures, and recommender systems. Learn more... 
Root Cause Localization and Automatic Recovery for Microservices
In this project, we will study the following research questions:
- Root cause localization: How to locate the root
cause of performance issues in microservices ?
(One proposed method: MicroRCA )
- Automatic recovery: Once root cause identified, what action should be taken to recover the performance degradation with no/minimum SLA violation. (On going )
- Extension to fog computing: In a geographical distributed, resource-constrained, network unreliable fog computing environment, how could we apply the approaches in cloud to it ? Learn more... 
Artificial Intelligence for Anomaly Detection and Root Cause Analysis in Cloud Systems
We combine various strategies, from both the traditional (e.g rule mining) and advanced learning methodologies (e.g. Deep Learning) to provide efficient, yet accurate models for the problem of anomaly detection in distributed IT systems. Learn more... 
+49 30 314-25154 (Sekr.)
Room TEL 1206/7
e-mail query 
+49 (30) 314-25397
Room TEL 1204
e-mail query