Real-time data processing applications with low latency requirements have led to the increasing popularity of stream processing systems. While such systems offer convenient APIs that can be used to achieve data parallelism automatically, they offer limited support for computations that require synchronization between parallel nodes. In this paper, we propose *dependency-guided synchronization (DGS)*, an alternative programming model and stream processing API for stateful streaming computations with complex synchronization requirements. In the proposed model, the input is viewed as partially ordered, and the program consists of a set of parallelization constructs which are applied to decompose the partial order and process events independently. Our API maps to an execution model called *synchronization plans* which supports synchronization between parallel nodes. Our evaluation shows that APIs offered by two widely used systems -- Flink and Timely Dataflow -- cannot suitably expose parallelism in some representative applications. In contrast, DGS enables implementations with scalable performance, the resulting synchronization plans offer throughput improvements when implemented manually in existing systems, and the programming overhead is small compared to writing sequential code.
翻译:低延迟要求的实时数据处理应用程序已导致流流处理系统越来越受欢迎。 虽然这种系统为自动实现数据平行化提供了方便的API, 但它们为需要平行节点同步的计算提供了有限的支持。 在本文中,我们建议“依赖性指导同步(DGS)* ”, 一种替代编程模式和流处理 API, 用于具有复杂同步要求的有声带流计算。 在拟议的模型中, 输入被视为部分定序, 程序由一套平行结构组成, 用于独立拆解部分顺序和进程事件。 我们的API 地图, 用于一个名为“同步计划* ” 的执行模式, 支持平行节点之间的同步。 我们的评价表明, 两个广泛使用的系统 -- -- Flink 和及时数据流 -- -- 提供的API 无法在某些有代表性的应用程序中适当暴露平行现象。 相反, DGS 能够使执行具有可缩放性的业绩, 由此产生的同步计划在现有系统中手工实施时提供吞吐量改进, 而编程管理间接费用比写顺序代码小。