MA: Adaptive Resource Management for Stream Processing Jobs
Distributed Stream Processing (DSP) systems are critical to the processing of vast amounts of data in real-time. It is here where events must traverse a graph of streaming operators to allow for the extraction of valuable information. There are many scenarios where this information is at its most valuable at the time of data arrival and therefore systems must deliver a predictable level of performance. Examples of such scenarios include IoT data processing, click stream analytics, network monitoring, financial fraud detection, spam filtering, news processing, etc. In order to process these large streams of data, DSP systems such as Storm, Spark, and Flink have been introduced which allow for the deployment of analytics pipelines which utilize the processing power of a cluster of commodity nodes. Applications developed within these frameworks are, in principle, required to operate indefinitely on an unbounded stream of continuous data in an environment where partial failures are to be expected as these applications scale. Consequently, DSP systems feature high availability modes, implement fault tolerance mechanisms by default, and expose a rich set of continuously evolving features.
However, jobs still need to be configured. For this purpose, the tuning of configuration parameters and provisioning of adequate resources is left up to the user where even experts often do not fully understand the systems and environmental factors influencing runtime behaviors. In such cases, an under-provisioned system fails to deliver the minimum performance requirements for efficient distributed processing and an over-provisioned system wastes valuable resources and energy while adding unnecessary higher costs. Finding good configurations in an environment which is continuously changing is a hard task where the ability to dynamically adapt these configurations is not a common feature offered by modern DSP systems. It is here where solutions are needed which take advantage of the latest developments in computer science.
Additionally, what happens when things go wrong? Fault tolerance is the ability of a running system to continue operating in the presence of partial failures. Such failures include machine hardware failures, network failures, transient program failures, etc. Ensuring the highest state update guarantees as data flows through a pipeline of interconnected real-time streaming systems is a hard problem to solve while at the same time being a vital and resource intensive operation. There is room to improve the performance of existing fault tolerance mechanisms by enabling them to adapt to changing runtime conditions and take Quality of Service (QoS) constraints into account. Additionally, as we become more dependent on IoT to foster better solutions to the demands of our ever growing societies, so shall we see the introduction of ever more heterogeneous environments replacing the traditional homogeneous data center type environments. Examples include connected devices dispersed over large geographical areas and composed of hardware possessing vastly different processing capabilities. As stated before, new solutions are needed to reduce the impact of things going wrong.
Thesis topics may focus on the following: adaptive fault tolerance, runtime optimization, automatic parameter tuning, monitoring, modeling and runtime prediction, profiling, resource allocation, and fault detection/identification. Theses will entail designing a general method, implementing a prototype in the context of existing open source systems, and experimentally evaluating the prototype with multiple benchmark jobs and large test data using one of our commodity clusters.
If you have an interest in these topics, feel free to email me with some background information about yourself so that we can quickly identify a fitting thesis topic together.