Page Content
Optimization and Fault Tolerance in Distributed Systems
Living in a data-driven interconnected world undeniable increases the relevance of distributed IT systems. The modern technological discurs is dominated by terms, such as Internet of Things (IoT), smart cities, sensor networks, 5G, autonomous transportation, Industry 4.0 and many more, and their necessity for IT. While bringing the potential great technologies like autonomous driving, virtual reality or remote surgery - just to state a few - their increased complexity and distribution breaks traditional system operation concepts. In general, systems with an increased numbers of highly interconnected components are hard to operate by human experts alone.
Recent advances in data analysis embodied in the term of Artificial Intelligence (AI) and machine learning (ML) demonstrate to be very useful for aiding expert practitioners from various fields. They allow to gain new insights into various problem fields and provide the opportunity for autonomous solutions. In the context of distributed IT systems their application is embedded in the term of AIOps (AI for system operations). The current goal of such systems is to support human experts with the task of operating large and distributed IT systems. Depending on the concrete application field, they are focusing on diverse objectives like high availability, fault tolerance, optimization or energy efficiency. A large body of researchers and global companies have recognized the AIOps approach as very relevant and are investing a huge amount of effort and resources to improve the area.
Since 2015, the CIT department has worked with industrial partners (Huawei Technologies Co. Ltd, Deutsche Telekom, Siemens AG, etc.) to establish joint research projects, working on AIOps solutions for anomaly detection/classification, predictive fault tolerance and auto-remediation. Several years of follow up work resulted in many innovative methods with a proven record of published papers on top-ranked scientific journals and conferences.
Currently, we have several projects focusing on developing novel methods for a variety of system monitoring data, reducing the cold-start problem of deploying models in production environments and approaches for shedding insights into the decision process of “black-box” AI models. Our research endeavours stretch further towards the impactful areas of edge clouds, fog computing and IoT. More specifically, the additional complications that arise from distributed localization of system components are opening a whole new spectrum of problems and research opportunities we actively work on exploring.
Topics of interest
- Internet of Things, Big Data Analytics
- Scalability, Availability and Reliability
- Service Level Agreements, Quality of Service
- Profiling, Performance Modeling, Testing, Simulations, Testbeds
- Virtualization, Network-as-a-Service, Infrastructure-as-a-Service
- AIOps for Edge and Fog Computing Infrastructures
- Detection and Description of Anomalies
- Explainable AI in the Context of Root Cause Analysis for Distributed IT Systems
- Data-driven Remediation and Recovery Strategies
Our current team
[1]
- © Copyright??
- Thorsten [2], Florian [3], Jasmin [4], Alexander [5], Li [6], Sasho [7], Soeren [8]
Research Methodology
We mostly adhere to empirical research methodology and primarily relying on and contributing to open-source tools. Evaluations are done against relevant open-source systems like OpenStack, Kubernetes or openVSwitch. We have access to several state-of-the-art infrastructures, including two commodity clusters consisting of 20 and 200 nodes, the faculty's HPC cluster, two OpeStack private clouds as well as an edge cloud based on OpenStack and a set of Rasbery Pi IoT devices and sensors.
Several extensive collaboration with industrial partners provides us with great opportunities to work with practical real-world systems on highly relevant solutions. As such we are able to gather a wide range of practical experience and tools that are directly applicable to the industry. This allows us to raise significant practical research questions.
Our highly motivated and supportive team spirit results in many opportunities for collaborations and joint research.
ZerOps - A Self-Healing Platform
The vision for ZerOps is to provide a scalable platform for monitoring, hierarchical in-place data analytics, and predictive system remediation. The term in-place refers to the explicit design goal to analyze collected data directly at the data source through streaming-based machine learning (ML) algorithms. Learn more... [9]
Anomaly Classification and Auto Remediation
This project aims at the remediation of recurring problems, which means that routine tasks of fixing those should be taken off from system administrators and allowing them to work on meaningful and interesting projects instead. Learn more... [10]
AIOps on Edge Computing Environments
The challenges of Edge and Fog Computing environments create a paradox situation: a vulnerable infrastructure has a decisive impact on our everyday life, as it delivers crucial data for i.e. autonomous driving, connected healthcare or other critical processes. Managing this complexity to oversee the entire system and react with short intervals to comply with the high requirements on Edge Computing surpasses the ability of human experts. Concerning this, a new concept of combining AI methods to operate such complex infrastructures (AIOps) is on the rise. Learn more... [11]
IT Log anomaly detection model generalization on inference stage
This research seeks to explore an end-to-end method for detecting anomalies based on logs within a variety of IT environments. This end-to-end process aims at the simplification of the monitoring and operation of the systems. Our target is to realize a general solution for anomaly detection in unknown systems. Learn more... [12]
Automation of Cloud Resilience Control
The resilience of cloud platforms (e.g., Deutsche Telekom OTC, Amazon AWS, and Microsoft Azure) is acquiring an increased relevance since society is relying more and more on complex software systems. Resiliency is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure. This cooperation research project requests for solutions which can establish a link between anomaly detection, recovery procedures, and recommender systems. Learn more... [13]
Root Cause Localization and Automatic Recovery for Microservices
In this project, we will study the following research questions:
- Root cause localization: How to locate the root
cause of performance issues in microservices ?
(One proposed method: MicroRCA [14]) - Automatic recovery: Once root cause identified, what action should be taken to recover the performance degradation with no/minimum SLA violation. (On going )
- Extension to fog computing: In a geographical distributed, resource-constrained, network unreliable fog computing environment, how could we apply the approaches in cloud to it ? Learn more... [15]
Artificial Intelligence for Anomaly Detection and Root Cause Analysis in Cloud Systems
We combine various strategies, from both the traditional (e.g rule mining) and advanced learning methodologies (e.g. Deep Learning) to provide efficient, yet accurate models for the problem of anomaly detection in distributed IT systems. Learn more... [16]
Related Publications
Order by: Author [20] Year [21] Journal [22]
2016
2017
2018
Contact
Odej Kao+49 30 314-25154 (Sekr.)
Room TEL 1206/7
e-mail query [48]
Contact
Alexander Acker+49 (30) 314-25397
Room TEL 1204
e-mail query [49]
os/team_2.jpg
thorsten/
lorian/
ski_jasmin/
xander/
i_sasho/
eren/
ation_and_fault_tolerance_in_distributed_systems/zerops
_a_self_healing_platform/parameter/en/maxhilfe/
zation_and_fault_tolerance_in_distributed_systems/anoma
ly_classification_and_auto_remediation/parameter/en/max
hilfe/
zation_and_fault_tolerance_in_distributed_systems/aiops
_on_edge_computing_environments/parameter/en/maxhilfe/
zation_and_fault_tolerance_in_distributed_systems/it_lo
g_anomaly_detection_model_generalization_on_inference_s
tage/parameter/en/maxhilfe/
zation_and_fault_tolerance_in_distributed_systems/autom
ation_of_cloud_resilience_control/parameter/en/maxhilfe
/
zation_and_fault_tolerance_in_distributed_systems/root_
cause_localization_and_automatic_recovery_for_microserv
ices/parameter/en/maxhilfe/
zation_and_fault_tolerance_in_distributed_systems/artif
icial_intelligence_for_anomaly_detection_and_root_cause
_analysis_in_cloud_systems/parameter/en/maxhilfe/
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?showp=2&tx_sibibtex_pi1%5Bsort%5D
=year%3A0&cHash=c7dac5cf1ae4978db587cb99ed20cc66
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?showp=3&tx_sibibtex_pi1%5Bsort%5D
=year%3A0&cHash=377d2aa16b6653920804de30b1531fc6
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?showp=2&tx_sibibtex_pi1%5Bsort%5D
=year%3A0&cHash=c7dac5cf1ae4978db587cb99ed20cc66
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?cHash=4db6c04d1b0b47c0637b7a72da7b8cb
4&tx_sibibtex_pi1%5Bsort%5D=author%3A1&type=1
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?cHash=393410799c1014a1e87d7e1c96849f1
1&tx_sibibtex_pi1%5Bsort%5D=year%3A1&type=1
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?cHash=b0dd014d9411867aa8cfe8c8fe6017f
a&tx_sibibtex_pi1%5Bsort%5D=journal%3A1&type=1
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09956&cHash=e1cd835f520b954e7f9900132f9378ea
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609956&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09965&cHash=7bb62eb1ac2a04497bb231e7aa10b0db
1877050916318269
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609965&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09976&cHash=72543c33f4fe2890cff871992d44b9c7
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609976&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09978&cHash=b969266ecb33e4fa29b55a38aa59c85e
1877050917313170
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609978&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09985&cHash=fa6d2b958f8f6d7cec8320ff9770f7bd
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609985&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09986&cHash=d459370f278a4477f0e25ea7c986556a
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609986&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09988&cHash=d0ad47545fe3b472b80c49c8f5daa812
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609988&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09989&cHash=e65a3d87c4372eb77889092ad25c94d9
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609989&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09990&cHash=abc9c7450e7cace9a14865eb6170edc2
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609990&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?tx_sibibtex_pi1%5Bcontentelement%5D=t
t_content%3A985775&tx_sibibtex_pi1%5BshowUid%5D=106
09991&cHash=65b949ce830232166309046a0d3cc112
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?no_cache=1&tx_sibibtex_pi1%5Bdown
load_bibtex_uid%5D=10609991&tx_sibibtex_pi1%5Bconte
ntelement%5D=tt_content%3A985775
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?showp=2&tx_sibibtex_pi1%5Bsort%5D
=year%3A0&cHash=c7dac5cf1ae4978db587cb99ed20cc66
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?showp=3&tx_sibibtex_pi1%5Bsort%5D
=year%3A0&cHash=377d2aa16b6653920804de30b1531fc6
zation_and_fault_tolerance_in_distributed_systems/param
eter/en/maxhilfe/?showp=2&tx_sibibtex_pi1%5Bsort%5D
=year%3A0&cHash=c7dac5cf1ae4978db587cb99ed20cc66
/parameter/en/maxhilfe/id/214446/?no_cache=1&ask_ma
il=YAsEJQAJ97Og1%2BhB2rmEhAxq04IG9HrKBHF8rggCDFk%3D&
;ask_name=Odej%20Kao
/parameter/en/maxhilfe/id/214446/?no_cache=1&ask_ma
il=YAsEJQAKMV%2BjBuBHZ5ooI2ZLaa6MO6oM%2BNEY6tBOW%2Fm8m0
wawOwErA%3D%3D&ask_name=Alexander%20Acker