TU Berlin

Department of Telecommunication SystemsOptimization and Fault Tolerance in Distributed Systems

Page Content

to Navigation

Optimization and Fault Tolerance in Distributed Systems

Living in a data-driven interconnected world undeniable increases the relevance of distributed IT systems. The modern technological discurs is dominated by terms, such as Internet of Things (IoT), smart cities, sensor networks, 5G, autonomous transportation, Industry 4.0 and many more, and their necessity for IT. While bringing the potential great technologies like autonomous driving, virtual reality or remote surgery - just to state a few - their increased complexity and distribution breaks traditional system operation concepts. In general, systems with an increased numbers of highly interconnected components are hard to operate by human experts alone.

Recent advances in data analysis embodied in the term of Artificial Intelligence (AI) and machine learning (ML) demonstrate to be very useful for aiding expert practitioners from various fields. They allow to gain new insights into various problem fields and provide the opportunity for autonomous solutions. In the context of distributed IT systems their application is embedded in the term of AIOps (AI for system operations). The current goal of such systems is to support human experts with the task of operating large and distributed IT systems. Depending on the concrete application field, they are focusing on diverse objectives like high availability, fault tolerance, optimization or energy efficiency. A large body of researchers and global companies have recognized the AIOps approach as very relevant and are investing a huge amount of effort and resources to improve the area.

Since 2015, the CIT department has worked with industrial partners (Huawei Technologies Co. Ltd, Deutsche Telekom, Siemens AG, etc.) to establish joint research projects, working on AIOps solutions for anomaly detection/classification, predictive fault tolerance and auto-remediation. Several years of follow up work resulted in many innovative methods with a proven record of published papers on top-ranked scientific journals and conferences. 

Currently, we have several projects focusing on developing novel methods for a variety of system monitoring data, reducing the cold-start problem of deploying models in production environments and approaches for shedding insights into the decision process of “black-box” AI models. Our research endeavours stretch further towards the impactful areas of edge clouds, fog computing and IoT. More specifically, the additional complications that arise from distributed localization of system components are opening a whole new spectrum of problems and research opportunities we actively work on exploring. 

Topics of interest

  • Internet of Things, Big Data Analytics
  • Scalability, Availability and Reliability
  • Service Level Agreements, Quality of Service
  • Profiling, Performance Modeling, Testing, Simulations, Testbeds
  • Virtualization, Network-as-a-Service, Infrastructure-as-a-Service
  • AIOps for Edge and Fog Computing Infrastructures
  • Detection and Description of Anomalies
  • Explainable AI in the Context of Root Cause Analysis for Distributed IT Systems
  • Data-driven Remediation and Recovery Strategies

Our current team

Lupe

Research Methodology

We mostly adhere to empirical research methodology and primarily relying on and contributing to open-source tools. Evaluations are done against relevant open-source systems like OpenStack, Kubernetes or openVSwitch. We have access to several state-of-the-art infrastructures, including two commodity clusters consisting of 20 and 200 nodes, the faculty's HPC cluster, two OpeStack private clouds as well as an edge cloud based on OpenStack and a set of Rasbery Pi IoT devices and sensors. 

Several extensive collaboration with industrial partners provides us with great opportunities to work with practical real-world systems on highly relevant solutions. As such we are able to gather a wide range of practical experience and tools that are directly applicable to the industry. This allows us to raise significant practical research questions.

Our highly motivated and supportive team spirit results in many opportunities for collaborations and joint research.

Projects

We have several currently active and upcoming projects.

ZerOps - A Self-Healing Platform

The vision for ZerOps is to provide a scalable platform for monitoring, hierarchical in-place data analytics, and predictive system remediation. The term in-place refers to the explicit design goal to analyze collected data directly at the data source through streaming-based machine learning (ML) algorithms. Learn more...

Anomaly Classification and Auto Remediation

This project aims at the remediation of recurring problems, which means that routine tasks of fixing those should be taken off from system administrators and allowing them to work on meaningful and interesting projects instead. Learn more...

AIOps on Edge Computing Environments

The challenges of Edge and Fog Computing environments create a paradox situation: a vulnerable infrastructure has a decisive impact on our everyday life, as it delivers crucial data for i.e. autonomous driving, connected healthcare or other critical processes. Managing this complexity to oversee the entire system and react with short intervals to comply with the high requirements on Edge Computing surpasses the ability of human experts. Concerning this, a new concept of combining AI methods to operate such complex infrastructures (AIOps) is on the rise.  Learn more...

IT Log anomaly detection model generalization on inference stage

This research seeks to explore an end-to-end method for detecting anomalies based on logs within a variety of IT environments. This end-to-end process aims at the simplification of the monitoring and operation of the systems. Our target is to realize a general solution for anomaly detection in unknown systems. Learn more...

Automation of Cloud Resilience Control

The resilience of cloud platforms (e.g., Deutsche Telekom OTC, Amazon AWS, and Microsoft Azure) is acquiring an increased relevance since society is relying more and more on complex software systems. Resiliency is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure. This cooperation research project requests for solutions which can establish a link between anomaly detection, recovery procedures, and recommender systems. Learn more...

Root Cause Localization and Automatic Recovery for Microservices

In this project, we will study the following research questions: 

  • Root cause localization:  How to locate the root cause of performance issues in microservices ? 
    (One proposed method: MicroRCA)
  • Automatic recovery:  Once root cause identified, what action should be taken to recover the performance degradation with no/minimum SLA violation. (On going )
  • Extension to fog computing:  In a geographical distributed, resource-constrained, network unreliable fog computing environment, how could we apply the approaches in cloud to it ? Learn more...

Artificial Intelligence for Anomaly Detection and Root Cause Analysis in Cloud Systems

We combine various strategies, from both the traditional (e.g rule mining) and advanced learning methodologies (e.g. Deep Learning) to provide efficient, yet accurate models for the problem of anomaly detection in distributed IT systems. Learn more...

Related Publications

2019

Anomaly Detection from System Tracing Data Using Multimodal Deep Learning

S. Nedelkoski and J. Cardoso and O. Kao

2019 IEEE 12th International Conference on Cloud Computing (CLOUD), 179-186. 2019

Download Bibtex entry

Anomaly Detection and Classification using Distributed Tracing and Deep Learning

S. Nedelkoski and J. Cardoso and O. Kao

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE/ACM, 241-250. 2019

Download Bibtex entry

Silent Consensus: Probabilistic Packet Sampling for Lightweight Network Monitoring

Wallschläger, Marcel and Acker, Alexander and Kao, Odej

Computational Science and Its Applications – ICCSA 2019. Springer International Publishing, 241–256. 2019

Download Bibtex entry

Anomaly Detection and Levels of Automation for AI-Supported System Administration

Gulenko, Anton and Kao, Odej and Schmidt, Florian

Annual International Symposium on Information Management and Big Data, 1–7. 2019

Download Bibtex entry

2018

Unsupervised Anomaly Event Detection for VNF Service Monitoring using Multivariate Online Arima

Schmidt, Florian and Suri-Payer, Florian and Gulenko, Anton and Wallschläger, Marcel and Acker, Alexander and Kao, Odej

2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE. 2018

Download Bibtex entry

Online Density Grid Pattern Analysis to Classify Anomalies in Cloud and NFV Systems

Acker, Alexander and Schmidt, Florian and Gulenko, Anton and Kao, Odej

2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE. 2018

Download Bibtex entry

Unsupervised Anomaly Event Detection for Cloud Monitoring using Online Arima

Schmidt, Florian and Suri-Payer, Florian and Gulenko, Anton and Wallschläger, Marcel and Acker, Alexander and Kao, Odej

2018 IEEE/ACM International Conference on Utility and Cloud Computing (UCC). IEEE. 2018

Download Bibtex entry

A Practical Implementation of In-Band Network Telemetry in Open vSwitch

Gulenko, Anton and Wallschläger, Marcel and Kao, Odej

2018 7th IEEE International Conference on Cloud Networking (CloudNet). IEEE. 2018

Download Bibtex entry

Anomaly Detection for Black Box Services in Edge Clouds Using Packet Size Distribution

Wallschläger, Marcel and Gulenko, Anton and Schmidt, Florian and Acker, Alexander and Kao, Odej

2018 7th IEEE International Conference on Cloud Networking (CloudNet). IEEE. 2018

Download Bibtex entry

Detecting Anomalous Behavior of Black-Box Services Modeled with Distance-Based Online Clustering

Gulenko, Anton and Schmidt, Florian and Acker, Alexander and Wallschlager, Marcel and Kao, Odej and Liu, Feng

2018 IEEE 11th International Conference on Cloud Computing (CLOUD), 912–915. 2018

Download Bibtex entry

Navigation

Quick Access

Schnellnavigation zur Seite über Nummerneingabe