MA: Adaptive Resource Management for Distributed Machine Learning
In recent years, machine learning applications have gained traction and increasingly large amounts of training data need to be processed by the models. Commonly, the machine learning models are trained in a distributed manner, as facilitated by currently popular frameworks and libraries like TensorFlow and Apache MXNet. Much of the difficulties of parallel programming are taken care of by the frameworks, which aim to provide simple interfaces to the user. However, a challenge that remains for the user is the selection of suitable computational resources, since this can lead to significant cost savings over often non-optimal manual selections.
Traditional distributed dataflow systems like Apache Flink, Apache Spark and MapReduce cover a wider set of applications and typically run on CPUs. A cluster configuration for those types of applications then consists of a node type and a scale-out. The nodes, which can also be virtual machines, are to be chosen in a way to balance resources like memory, compute and I/O fit the resource requirements to the workload. Allocating resources for such applications has been the subject of many research efforts. Scale-outs are often chosen with reaching a certain runtime target in mind, based on performance models. In some cases even live re-allocation in accordance with projected and desired execution times can happen. Also co-locating workloads can lead to overall performance increase and has been subject of research.
With machine learning systems, the cards are being reshuffled. For many of the most common machine learning applications, the computational efforts are shifted to the GPU or even more specialized hardware like TPUs/ASICs. Additionally, communication patterns in distributed machine learning applications, as facilitated by e.g. parameter servers, have characteristics differing from those of more traditional, MapReduce-like applications.
You will have the opportunity to develop a solution for the efficient and/or adaptive allocation of resources for applications of machine learning frameworks. Concrete theses in this area may focus on the following topics: Monitoring, modeling and runtime prediction, profiling, resource allocation, scheduling and placement, runtime adjustments, and automatic system configuration. All theses will entail designing a general method, implementing a prototype in the context of existing open source systems, and experimentally evaluating the prototype with multiple benchmark jobs and large test data using resources of university GPU cluster resources and/or public cloud resources.
If this sounds interesting to you, please send me an email with a little bit of background information on yourself, so we can quickly identify a fitting thesis topic together.