LOG.io：面向分布式数据流水线的统一回滚恢复与数据血缘捕获系统 (LOG.io: Unified Rollback Recovery and Data Lineage Capture for Distributed Data Pipelines)

This paper introduces LOG.io, a comprehensive solution designed for correct rollback recovery and fine-grain data lineage capture in distributed data pipelines. It is tailored for serverless scalable architectures and uses a log-based rollback recovery protocol. LOG.io supports a general programming model, accommodating non-deterministic operators, interactions with external systems, and arbitrary custom code. It is non-blocking, allowing failed operators to recover independently without interrupting other active operators, thereby leveraging data parallelization, and it facilitates dynamic scaling of operators during pipeline execution. Performance evaluations, conducted within the SAP Data Intelligence system, compare LOG.io with the Asynchronous Barrier Snapshotting (ABS) protocol, originally implemented in Flink. Our experiments show that when there are straggler operators in a data pipeline and the throughput of events is moderate (e.g., 1 event every 100 ms), LOG.io performs as well as ABS during normal processing and outperforms ABS during recovery. Otherwise, ABS performs better than LOG.io for both normal processing and recovery. However, we show that in these cases, data parallelization can largely reduce the overhead of LOG.io while ABS does not improve. Finally, we show that the overhead of data lineage capture, at the granularity of the event and between any two operators in a pipeline, is marginal, with less than 1.5% in all our experiments.

翻译：本文介绍了LOG.io，一个为分布式数据流水线中的正确回滚恢复与细粒度数据血缘捕获而设计的综合解决方案。该系统专为无服务器可扩展架构定制，采用基于日志的回滚恢复协议。LOG.io支持通用的编程模型，能够容纳非确定性算子、与外部系统的交互以及任意自定义代码。其采用非阻塞设计，允许故障算子独立恢复而不中断其他活跃算子，从而充分利用数据并行化能力，并支持流水线执行期间算子的动态扩缩容。我们在SAP Data Intelligence系统中进行了性能评估，将LOG.io与最初在Flink中实现的异步屏障快照（ABS）协议进行了对比。实验表明，当数据流水线中存在滞后算子且事件吞吐量适中（例如每100毫秒1个事件）时，LOG.io在正常处理阶段表现与ABS相当，而在恢复阶段优于ABS。反之，在其他情况下ABS在正常处理与恢复阶段均优于LOG.io。然而，我们证明在这些场景中，数据并行化能显著降低LOG.io的开销，而ABS则无法通过并行化获得提升。最后，我们验证了在事件粒度及流水线内任意两个算子间捕获数据血缘的开销是可忽略的，在所有实验中均低于1.5%。