Some mission critical systems, e.g., fraud detection, require accurate, real-time metrics over long time sliding windows on applications that demand high throughput and low latencies. As these applications need to run 'forever' and cope with large, spiky data loads, they further require to be run in a distributed setting. We are unaware of any streaming system that provides all those properties. Instead, existing systems take large simplifications, such as implementing sliding windows as a fixed set of overlapping windows, jeopardizing metric accuracy (violating regulatory rules) or latency (breaching service agreements). In this paper, we propose Railgun, a fault-tolerant, elastic, and distributed streaming system supporting real-time sliding windows for scenarios requiring high loads and millisecond-level latencies. We benchmarked an initial prototype of Railgun using real data, showing significant lower latency than Flink and low memory usage independent of window size. Further, we show that Railgun scales nearly linearly, respecting our msec-level latencies at high percentiles (<250ms @ 99.9%) even under a load of 1 million events per second.
翻译:某些任务关键系统,例如欺诈检测,要求对需要高吞吐量和低迟缓的应用程序采用准确、实时的长时滑动窗口。由于这些应用程序需要“永远”运行并应对巨大的、粗糙的数据负荷,它们还需要在分布式环境中运行。我们不知道任何提供所有这些特性的流流系统。相反,现有的系统需要大量简化,例如将滑动窗口作为固定的重叠窗口加以实施,损害测量精确度(违反监管规则)或延缓(影响性服务协议)。在本文件中,我们建议使用防故障、弹性和分布式流动系统,支持需要高负荷和超秒延迟的实时滑动窗口。我们用真实数据对铁路枪支的初始原型进行了基准测试,显示的耐久性比Flink低得多,记忆用量也比窗户大小低得多。此外,我们显示,在高百分位( < 250ms@99.9 %),甚至以每秒100万次的负载事件为基础,铁路枪量几乎直线度测量,尊重我们高比例的延缓度( < 250s@99.9 % ) 。