Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead. In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.
翻译:由于用户越来越依赖快速获取新结果的能力,因此,根据这些结果作出及时决定取决于系统容忍失败的能力。通常,这些系统会通过执行检查站和回退回收实现错容和从部分故障中自动恢复的能力。然而,由于这些分布式环境中发生部分故障的统计概率以及预期工作将运作的工作量变化不定,静态配置往往无法满足服务质的制约,而管理费用低。在本文中,我们介绍Khaos,这是利用虚拟云自动化技术平行处理能力的新方法,以自动运行时间优化分布式流体处理工作中的错误容忍配置。我们的方法采用三个后续阶段,借鉴Chaos工程原则:建立稳定状态处理条件,进行实验以更好地了解系统如何在失败情况下运行,并利用这一知识不断尽量减少服务质的违规情况。我们与阿帕奇·弗林克一道,对Khaos进行了直接应用,并展示了其实用性实验性。