TU Berlin

Department of Telecommunication SystemsOptimization and Fault Tolerance in Distributed Systems

Page Content

to Navigation

Optimization and Fault Tolerance in Distributed Systems

Living in a data-driven interconnected world undeniable increases the relevance of distributed IT systems. The modern technological discurs is dominated by terms, such as Internet of Things (IoT), smart cities, sensor networks, 5G, autonomous transportation, Industry 4.0 and many more, and their necessity for IT. While bringing the potential great technologies like autonomous driving, virtual reality or remote surgery - just to state a few - their increased complexity and distribution breaks traditional system operation concepts. In general, systems with an increased numbers of highly interconnected components are hard to operate by human experts alone.

Recent advances in data analysis embodied in the term of Artificial Intelligence (AI) and machine learning (ML) demonstrate to be very useful for aiding expert practitioners from various fields. They allow to gain new insights into various problem fields and provide the opportunity for autonomous solutions. In the context of distributed IT systems their application is embedded in the term of AIOps (AI for system operations). The current goal of such systems is to support human experts with the task of operating large and distributed IT systems. Depending on the concrete application field, they are focusing on diverse objectives like high availability, fault tolerance, optimization or energy efficiency. A large body of researchers and global companies have recognized the AIOps approach as very relevant and are investing a huge amount of effort and resources to improve the area.

Since 2015, the CIT department has worked with industrial partners (Huawei Technologies Co. Ltd, Deutsche Telekom, Siemens AG, etc.) to establish joint research projects, working on AIOps solutions for anomaly detection/classification, predictive fault tolerance and auto-remediation. Several years of follow up work resulted in many innovative methods with a proven record of published papers on top-ranked scientific journals and conferences. 

Currently, we have several projects focusing on developing novel methods for a variety of system monitoring data, reducing the cold-start problem of deploying models in production environments and approaches for shedding insights into the decision process of “black-box” AI models. Our research endeavours stretch further towards the impactful areas of edge clouds, fog computing and IoT. More specifically, the additional complications that arise from distributed localization of system components are opening a whole new spectrum of problems and research opportunities we actively work on exploring. 

Topics of interest

  • Internet of Things, Big Data Analytics
  • Scalability, Availability and Reliability
  • Service Level Agreements, Quality of Service
  • Profiling, Performance Modeling, Testing, Simulations, Testbeds
  • Virtualization, Network-as-a-Service, Infrastructure-as-a-Service
  • AIOps for Edge and Fog Computing Infrastructures
  • Detection and Description of Anomalies
  • Explainable AI in the Context of Root Cause Analysis for Distributed IT Systems
  • Data-driven Remediation and Recovery Strategies

Our current team


Research Methodology

We mostly adhere to empirical research methodology and primarily relying on and contributing to open-source tools. Evaluations are done against relevant open-source systems like OpenStack, Kubernetes or openVSwitch. We have access to several state-of-the-art infrastructures, including two commodity clusters consisting of 20 and 200 nodes, the faculty's HPC cluster, two OpeStack private clouds as well as an edge cloud based on OpenStack and a set of Rasbery Pi IoT devices and sensors. 

Several extensive collaboration with industrial partners provides us with great opportunities to work with practical real-world systems on highly relevant solutions. As such we are able to gather a wide range of practical experience and tools that are directly applicable to the industry. This allows us to raise significant practical research questions.

Our highly motivated and supportive team spirit results in many opportunities for collaborations and joint research.


We have several currently active and upcoming projects.

ZerOps - A Self-Healing Platform

The vision for ZerOps is to provide a scalable platform for monitoring, hierarchical in-place data analytics, and predictive system remediation. The term in-place refers to the explicit design goal to analyze collected data directly at the data source through streaming-based machine learning (ML) algorithms. Learn more...

Anomaly Classification and Auto Remediation

This project aims at the remediation of recurring problems, which means that routine tasks of fixing those should be taken off from system administrators and allowing them to work on meaningful and interesting projects instead. Learn more...

AIOps on Edge Computing Environments

The challenges of Edge and Fog Computing environments create a paradox situation: a vulnerable infrastructure has a decisive impact on our everyday life, as it delivers crucial data for i.e. autonomous driving, connected healthcare or other critical processes. Managing this complexity to oversee the entire system and react with short intervals to comply with the high requirements on Edge Computing surpasses the ability of human experts. Concerning this, a new concept of combining AI methods to operate such complex infrastructures (AIOps) is on the rise.  Learn more...

IT Log anomaly detection model generalization on inference stage

This research seeks to explore an end-to-end method for detecting anomalies based on logs within a variety of IT environments. This end-to-end process aims at the simplification of the monitoring and operation of the systems. Our target is to realize a general solution for anomaly detection in unknown systems. Learn more...

Automation of Cloud Resilience Control

The resilience of cloud platforms (e.g., Deutsche Telekom OTC, Amazon AWS, and Microsoft Azure) is acquiring an increased relevance since society is relying more and more on complex software systems. Resiliency is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure. This cooperation research project requests for solutions which can establish a link between anomaly detection, recovery procedures, and recommender systems. Learn more...

Root Cause Localization and Automatic Recovery for Microservices

In this project, we will study the following research questions: 

  • Root cause localization:  How to locate the root cause of performance issues in microservices ? 
    (One proposed method: MicroRCA)
  • Automatic recovery:  Once root cause identified, what action should be taken to recover the performance degradation with no/minimum SLA violation. (On going )
  • Extension to fog computing:  In a geographical distributed, resource-constrained, network unreliable fog computing environment, how could we apply the approaches in cloud to it ? Learn more...

Artificial Intelligence for Anomaly Detection and Root Cause Analysis in Cloud Systems

We combine various strategies, from both the traditional (e.g rule mining) and advanced learning methodologies (e.g. Deep Learning) to provide efficient, yet accurate models for the problem of anomaly detection in distributed IT systems. Learn more...

Related Publications


A2Log: Attentive Augmented Log Anomaly Detection

Wittkopp, Thorsten and Acker, Alexander and Nedelkoski, Sasho and Bogatinovski, Jasmin and Scheinert, Dominik and Fan, Wu and Odej Kao

55th Hawaii International Conference on Systems Science, to appear. 2022

Download Bibtex entry


LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak Supervision

Wittkopp, Thorsten and Wiesner, Philipp and Scheinert, Dominik and Acker, Alexander

19th International Conference on Service-Oriented Computing, to appear. 2021

Download Bibtex entry

A Taxonomy of Anomalies in Log Data

Wittkopp, Thorsten and Wiesner, Philipp and Scheinert, Dominik and Kao, Odej

19th International Conference on Service-Oriented Computing, to appear. 2021

Download Bibtex entry

Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Scheinert, Dominik and Acker, Alexander and Thamsen, Lauritz and Geldenhuys, Morgan K. and Kao, Odej

Workshop Proceedings of the 43th International Conference on Software Engineering, 7-12. 2021

Download Bibtex entry

Artificial Intelligence for IT Operations (AIOPS) Workshop White Paper

Jasmin Bogatinovski and Sasho Nedelkoski and Alexander Acker and Florian Schmidt and Thorsten Wittkopp and Soeren Becker and Jorge Cardoso and Odej Kao


Download Bibtex entry


Performance Diagnosis in Cloud Microservices using Deep Learning

Wu, Li and Bogatinovski, Jasmin and Nedelkoski, Sasho and Tordsson, Johan and Kao, Odej

18th International Conference on Service-Oriented Computing, To appear. 2020

Download Bibtex entry

Towards AIOps in Edge Computing Environments

Becker, Soeren and Schmidt, Florian and Gulenko, Anton and Acker, Alexander and Kao, Odej

2020 IEEE International Conference on Big Data. IEEE, 3470–3475. 2020

Download Bibtex entry

TELESTO: A Graph Neural Network Model for Anomaly Classification in Cloud Services

Scheinert, Dominik and Acker, Alexander

18th International Conference on Service-Oriented Computing, 214-227. 2020

Download Bibtex entry

Decentralized Federated Learning Preserves Model and Data Privacy

Thorsten Wittkopp and Alexander Acker

18th International Conference on Service-Oriented Computing, 176–187. 2020

Link to original publication Download Bibtex entry

Self-Attentive Classification-Based Anomaly Detection in Unstructured Logs

Nedelkoski, Sasho and Bogatinovski, Jasmin and Acker, Alexander and Cardoso, Jorge and Kao, Odej

ICDM 2020: 20th IEEE International Conference on Data Mining, 1196–1201. 2020

Link to original publication Download Bibtex entry


Quick Access

Schnellnavigation zur Seite über Nummerneingabe