MA: Adaptive Resource Management for Data-Parallel Batch Jobs
Many organizations need to analyze increasingly large datasets such as data collected by thousands of sensors. Even with the parallel capacities that single nodes provide today, the size of some datasets requires the resources of many nodes. Consequently, distributed systems have been developed to manage and process large datasets with clusters of commodity resources. A popular class of such distributed systems are distributed dataflow systems like Flink, Spark, and Beam. These systems offer high-level programming abstractions resolving around operators and efficient, fault-tolerant distributed runtime environments. These runtimes provide e.g. effective data partitioning, data-parallel operator implementations, task distribution and monitoring, and data transfer and communication among workers. Arguably, distributed dataflow systems make it considerably easier for users to develop data-parallel programs that make use of large sets of cluster resources. However, users still need to select adequate resources for their jobs and carefully configure the systems for efficient distributed processing. Yet, even expert users often do not fully understand system and workload dynamics since many factors determine the runtime behavior (e.g. programs, datasets, systems, configurations, architectures). In fact, users currently overprovision heavily to make sure their jobs meet minimal performance requirements, but even systems like Flink and Spark do not scale without overheads, while the scale-out behavior of particular jobs is often not completely straightforward. Moreover, the usefulness of additional compute resources is ultimately limited by the rate at which data is ingested. Therefore, significant overprovisioning leads to low resource utilization and thus unnecessary costs and energy consumption.
Instead of having users essentially guess adequate sets of resources and system configurations, resource management systems should effectively support users with these tasks. That is, resource managers should automatically tune resource allocations, job scheduling, and data placement based on models of the workload and user-provided performance constraints. Such models can be learned either from a cluster’s execution history or dedicated profiling runs (or a combination of both). Ultimately, the goal is to allow users to fully concentrate on their programs and to have systems make more informed resource management decisions.
Concrete theses in this area may focus on the following topics: monitoring, modeling and runtime prediction, model training, profiling, resource allocation, scheduling and placement, runtime adjustments, and automatic system configuration. All theses will entail designing a general method, implementing a prototype in the context of existing open source systems, and experimentally evaluating the prototype with multiple benchmark jobs and large test data using one of our commodity clusters.
If this sounds interesting to you, please send me an email with a little bit of background information on yourself, so we can quickly identify a fitting thesis topic together.