Stream processing is extensively used in the IoT-to-Cloud spectrum to distill information from continuous streams of data. Streaming applications usually run in dedicated Stream Processing Engines (SPEs) that adopt the DataFlow model, which defines such applications as graphs of operators that, step by step, transform data into the desired results. As operators can be deployed and executed independently, the DataFlow model supports parallelism and distribution, thus making streaming applications scalable. Today, we witness an abundance of SPEs, each with its set of operators. In this context, understanding how operators' semantics overlap within and across SPEs, and thus which SPEs can support a given application, is not trivial. We tackle this problem by formally showing that common operators of SPEs can be expressed as compositions of a single, minimalistic Aggregate operator, thus showing any framework able to run compositions of such an operator can run applications defined for state-of-the-art SPEs. The Aggregate operator only relies on core concepts of the DataFlow model such as data partitioning by key and time-based windows, and can only output up to one value for each window it analyzes. Together with our formal argumentation, we empirically assess how an SPE that only relies on such an operator compares with an SPE offering operator-specific implementations, as well as study the performance impact of a more expressive Aggregate operator by relaxing the constraint of outputting up to one value per window. The existence of such a common denominator not only implies the portability of operators within and across SPEs but also defines a concise set of requirements for other data processing frameworks to support streaming applications.
翻译:流体处理在IoT至Cloud 频谱中广泛使用,从连续的数据流中提取信息。 流式处理程序通常在使用 DataFlow 模型的专用流式处理引擎(SPEs) 中运行, 该模型将应用程序定义为操作者图表, 将数据转换成理想结果。 当操作者可以独立部署和执行时, DataFlow 模型支持平行和分布, 从而使得流式应用程序可以缩放。 今天, 我们见证了大量的 SPE 应用程序, 每个都有其操作者组合。 在这方面, 了解操作者在SPE 内部和跨 SPE 系统中的语义处理方式, 了解操作者是如何在专用的流式处理引擎中重叠的, 并了解操作者是如何支持特定应用程序的。 我们处理这一问题的方法是正式显示, 特殊操作者的共同操作者可以表现为单一的、 最小的整合操作者的构成, 从而显示任何能够运行这种操作者组成的框架只能运行为当前最精确的 SPE 。 集运算操作者只能依靠数据模型的核心概念, 例如数据流式的每个数据流式的流式配置, 将S- betradeviews 的运行者用来分析S betradeal 。</s>