A computing job in a big data system can take a long time to run, especially for pipelined executions on data streams. Developers often need to change the computing logic of the job such as fixing a loophole in an operator or changing the machine learning model in an operator with a cheaper model to handle a sudden increase of the data-ingestion rate. Recently many systems have started supporting runtime reconfigurations to allow this type of change on the fly without killing and restarting the execution. While the delay in reconfiguration is critical to performance, existing systems use epochs to do runtime reconfigurations, which can cause a long delay. In this paper we develop a new technique called Fries that leverages the emerging availability of fast control messages in many systems, since these messages can be sent without being blocked by data messages. We formally define consistency in runtime reconfigurations, and develop a Fries scheduler with consistency guarantees. The technique not only works for different classes of dataflows, but also works for parallel executions and supports fault tolerance. Our extensive experimental evaluation on clusters show the advantages of this technique compared to epoch-based schedulers.
翻译:大数据系统中的计算工作可能需要很长时间才能运行,特别是在数据流的编审中。 开发者往往需要改变这项工作的计算逻辑, 如在操作者中填补一个漏洞, 或者在操作者中改变机器学习模式, 处理数据摄取率的突然上升。 最近许多系统已开始支持运行时间重组, 以便允许在飞行上进行这种类型的改变, 而不杀死和重新启动执行。 虽然重组的延迟对性能至关重要, 现有的系统使用一些方法来进行运行时间重组, 这可能会造成长时间的延迟。 在本文中, 我们开发了一种名为 Fries 的新技术, 利用许多系统中新出现的快速控制信息的供应, 因为这些信息可以在不受到数据信息阻断的情况下发送 。 我们正式确定运行时间重组的一致性, 并开发一个具有一致性保证的 Fries 调度器。 技术不仅对不同类别的数据流起作用, 而且还用于平行的处决和支持错误容忍性。 我们对集群的广泛实验评估显示该技术相对于基于近程的调度器的优势 。