Adaptive Fault Tolerance for Distributed Dataflow Systems
Many organizations need to analyze increasingly large datasets such as data collected by thousands of sensors. Even with the parallel capacities that single nodes provide today, the size of some datasets requires the resources of many nodes. Consequently, distributed systems have been developed to manage and process large datasets with clusters of commodity resources. A popular class of such distributed systems are distributed dataflow systems like MapReduce, Spark, and Flink. These systems provide effective data partitioning, data-parallel operator implementations, task distribution and monitoring, efficient data transfer and communication among workers, and fault tolerance. Different strategies for handling failures are currently used: shuffling the intermediate results between operations through file systems (MapReduce), storing lineage of partially processed data (Spark), or re-trying failed jobs fully (Flink). These strategies offer different trade-offs between additional overhead and re-doing more work when failures occur. Fault tolerance strategies can also be fine-tuned (e.g. snapshotting intermediate results every X iterations of a job) and multiple strategies can be combined. Ideally, the amount of work that has not to be repeated even when nodes fail outweigh the additional overhead of the fault tolerance strategy on average.
To optimally choose a reasonable fault tolerance policy together with an appropriate parametrization requires in-depth understanding of the written job, the input data, the underlying distributed processing system, and the specifics of the cluster environment. Usually, system operators have only limited insight into the former aspects, while users are typically less familiar with the latter. This renders the procedure of having an individual fault tolerance policy and parametrization tailored for the requirements and properties of each job practically infeasible. Instead, distributed dataflow systems are currently applying either no fault tolerance at all or a static “one size fits all” configuration for all jobs.
An alternative to the in-depth understanding and hand-tuning of each job is potentially provided by the broad field of machine learning, which allows the distillation of general patterns within datasets of adequate size. Having access to datasets containing information of a variety of jobs together with their outcome (success or failure) allows to search for general patterns that enable an individual fault tolerance policy selection and parametrization.
Two questions are of interest for this thesis. First, it should be analyzed whether it is possible to discover general patterns within the configuraton data (requested resources, operators, parallelism, input size etc.) of submitted jobs that indicate a higher probability of failure. Having such knowledge would allow to automatically configure fault tolerance policies for each job before it is executed. Second, an online analysis of the actually consumed metrics and other runtime information can be utilized to make continuous predictions about the outcome of the job (success or failure). This allows to dynamically adjust fault tolerance configurations at runtime in order to react to imminent failures or the absence of those and thus, optimize for former or latter outcome.
Specifically, this thesis will analys three publicly available datasets from Google Cloud , CMU OpenCloud , and Alibaba . First, the applicability for the above described use cases should be verified. Next, concrete modelling approaches should be designed and verified based on the selected dataset(s). Hereby, the solutions also need to consider the overhead of running predictions and compare this to the time saved for recovering from failures and to the benefits that come from a dynamic parameterization compared to a static approach. A literature research on currently adaptive fault tolerance for distributed dataflow systems is also required.