TU Berlin

Department of Telecommunication SystemsOptimization and Fault Tolerance in Distributed Systems

Page Content

to Navigation

Optimization and Fault Tolerance in Distributed Systems

Living in a data-driven interconnected world undeniable increases the relevance of distributed IT systems. The modern technological discurs is dominated by terms, such as Internet of Things (IoT), smart cities, sensor networks, 5G, autonomous transportation, Industry 4.0 and many more, and their necessity for IT. While bringing the potential great technologies like autonomous driving, virtual reality or remote surgery - just to state a few - their increased complexity and distribution breaks traditional system operation concepts. In general, systems with an increased numbers of highly interconnected components are hard to operate by human experts alone.

Recent advances in data analysis embodied in the term of Artificial Intelligence (AI) and machine learning (ML) demonstrate to be very useful for aiding expert practitioners from various fields. They allow to gain new insights into various problem fields and provide the opportunity for autonomous solutions. In the context of distributed IT systems their application is embedded in the term of AIOps (AI for system operations). The current goal of such systems is to support human experts with the task of operating large and distributed IT systems. Depending on the concrete application field, they are focusing on diverse objectives like high availability, fault tolerance, optimization or energy efficiency. A large body of researchers and global companies have recognized the AIOps approach as very relevant and are investing a huge amount of effort and resources to improve the area.

Since 2015, the CIT department has worked with industrial partners (Huawei Technologies Co. Ltd, Deutsche Telekom, Siemens AG, etc.) to establish joint research projects, working on AIOps solutions for anomaly detection/classification, predictive fault tolerance and auto-remediation. Several years of follow up work resulted in many innovative methods with a proven record of published papers on top-ranked scientific journals and conferences. 

Currently, we have several projects focusing on developing novel methods for a variety of system monitoring data, reducing the cold-start problem of deploying models in production environments and approaches for shedding insights into the decision process of “black-box” AI models. Our research endeavours stretch further towards the impactful areas of edge clouds, fog computing and IoT. More specifically, the additional complications that arise from distributed localization of system components are opening a whole new spectrum of problems and research opportunities we actively work on exploring. 

Topics of interest

  • Internet of Things, Big Data Analytics
  • Scalability, Availability and Reliability
  • Service Level Agreements, Quality of Service
  • Profiling, Performance Modeling, Testing, Simulations, Testbeds
  • Virtualization, Network-as-a-Service, Infrastructure-as-a-Service
  • AIOps for Edge and Fog Computing Infrastructures
  • Detection and Description of Anomalies
  • Explainable AI in the Context of Root Cause Analysis for Distributed IT Systems
  • Data-driven Remediation and Recovery Strategies

Our current team

Lupe

Research Methodlogy

We mostly adhere to empirical research methodology and primarily relying on and contributing to open-source tools. Evaluations are done against relevant open-source systems like OpenStack, Kubernetes or openVSwitch. We have access to several state-of-the-art infrastructures, including two commodity clusters consisting of 20 and 200 nodes, the faculty's HPC cluster, two OpeStack private clouds as well as an edge cloud based on OpenStack and a set of Rasbery Pi IoT devices and sensors. 

Several extensive collaboration with industrial partners provides us with great opportunities to work with practical real-world systems on highly relevant solutions. As such we are able to gather a wide range of practical experience and tools that are directly applicable to the industry. This allows us to raise significant practical research questions.

Our highly motivated and supportive team spirit results in many opportunities for collaborations and joint research.

Projects

We have several currently active and upcoming projects.

ZerOps - A Self-Healing Platform

The vision for ZerOps is to provide a scalable platform for monitoring, hierarchical in-place data analytics, and predictive system remediation. The term in-place refers to the explicit design goal to analyze collected data directly at the data source through streaming-based machine learning (ML) algorithms. Learn more...

Anomaly Classification and Auto Remediation

This project aims at the remediation of recurring problems, which means that routine tasks of fixing those should be taken off from system administrators and allowing them to work on meaningful and interesting projects instead. Learn more...

AIOps on Edge Computing Environments

The challenges of Edge and Fog Computing environments create a paradox situation: a vulnerable infrastructure has a decisive impact on our everyday life, as it delivers crucial data for i.e. autonomous driving, connected healthcare or other critical processes. Managing this complexity to oversee the entire system and react with short intervals to comply with the high requirements on Edge Computing surpasses the ability of human experts. Concerning this, a new concept of combining AI methods to operate such complex infrastructures (AIOps) is on the rise.  Learn more...

IT Log anomaly detection model generalization on inference stage

This research seeks to explore an end-to-end method for detecting anomalies based on logs within a variety of IT environments. This end-to-end process aims at the simplification of the monitoring and operation of the systems. Our target is to realize a general solution for anomaly detection in unknown systems. Learn more...

Automation of Cloud Resilience Control

The resilience of cloud platforms (e.g., Deutsche Telekom OTC, Amazon AWS, and Microsoft Azure) is acquiring an increased relevance since society is relying more and more on complex software systems. Resiliency is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure. This cooperation research project requests for solutions which can establish a link between anomaly detection, recovery procedures, and recommender systems. Learn more...

Root Cause Localization and Automatic Recovery for Microservices

In this project, we will study the following research questions: 

  • Root cause localization:  How to locate the root cause of performance issues in microservices ? 
    (One proposed method: MicroRCA)
  • Automatic recovery:  Once root cause identified, what action should be taken to recover the performance degradation with no/minimum SLA violation. (On going )
  • Extension to fog computing:  In a geographical distributed, resource-constrained, network unreliable fog computing environment, how could we apply the approaches in cloud to it ? Learn more...

Artificial Intelligence for Anomaly Detection and Root Cause Analysis in Cloud Systems

We combine various strategies, from both the traditional (e.g rule mining) and advanced learning methodologies (e.g. Deep Learning) to provide efficient, yet accurate models for the problem of anomaly detection in distributed IT systems. Learn more...

Related Publications

2020

Multi-Source Distributed System Data for AI-powered Analytics

Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay, and Cardoso, Jorge and Kao, Odej

ESOCC 2020: European Conference On Service-Oriented And Cloud Computing. Springer, To appear. 2020

Link to code repository Download Bibtex entry

MicroRCA: Root Cause Localization of Performance Issues in Microservices

Wu, Li and Tordsson, Johan and Elmroth, Erik and Kao, Odej

NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium, 1–9. 2020

Link to publication Download Bibtex entry

Learning more expressive joint distributions in multimodal variational methods

S. Nedelkoski and M. Bogojevski and O. Kao

2020 International Conference on Machine Learning, Optimization, and Data Science, LOD 2020, To appear. 2020

Download Bibtex entry

Self-Supervised Log parsing

S. Nedelkoski and J. Bogatinovski and A. Acker and J. Cardoso and O. Kao

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD 2020, To appear. 2020

Download Bibtex entry

Bitflow: An In Situ Stream Processing Framework

Gulenko, Anton and Acker, Alexander and Schmidt, Florian and Becker, Soeren and Kao, Odej

International Conference on Autonomic Computing and Self-Organizing Systems, To appear. 2020

Link to code repository Download Bibtex entry

AI-Governance and Levels of Automation for AIOps-supported system administration

Gulenko, Anton and Acker, Alexander and Kao, Odej and Liu, Feng

The 29th International Conference on Computer Communications and Networks, To appear. 2020

Download Bibtex entry

Superiority of Simplicity: A Lightweight Model for Network Device Workload Prediction

Acker, Alexander and Wittkopp, Thorsten and Nedelkoski, Sasho and Bogatinovski, Jasmin and Kao, Odej

15th Conference on Computer Science and Information Systems, To appear. 2020

Link to publication Link to code repository Download Bibtex entry

2019

Unsupervised Anomaly Alerting for IoT-Gateway Monitoring using Adaptive Thresholds and Half-Space Trees

Wetzig, René and Gulenko, Anton and Schmidt, Florian

2019 Sixth International Conference on Internet of Things: Systems, Management and Security (IOTSMS). IEEE, 161–168. 2019

Download Bibtex entry

Multilayer Active Learning for Efficient Learning and Resource Usage in Distributed IoT Architectures

Nedelkoski, Sasho and Thamsen, Lauritz and Verbitskiy, Ilya and Kao, Odej

2019 IEEE International Conference on Edge Computing (EDGE). IEEE, 8-12. 2019

Download Bibtex entry

Anomaly Detection from System Tracing Data Using Multimodal Deep Learning

S. Nedelkoski and J. Cardoso and O. Kao

2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 179-186. 2019

Download Bibtex entry

Navigation

Quick Access

Schnellnavigation zur Seite über Nummerneingabe