Rapid detection and mitigation of issues that impact performance and reliability is paramount for large-scale online services. For real-time detection of such issues, datacenter operators use a stream processor and analyze streams of monitoring data collected from servers (referred to as data source nodes) and their hosted services. The timely processing of incoming streams requires the network to transfer massive amounts of data, and significant compute resources to process it. These factors often create bottlenecks for stream analytics. To help overcome these bottlenecks, current monitoring systems employ near-data processing by either computing an optimal query partition based on a cost model or using model-agnostic heuristics. Optimal partitioning is computationally expensive, while model-agnostic heuristics are iterative and search over a large solution space. We combine these approaches by using model-agnostic heuristics to improve the partitioning solution from a model-based heuristic. Moreover, current systems use operator-level partitioning: if a data source does not have sufficient resources to execute an operator on all records, the operator is executed only on the stream processor. Instead, we perform data-level partitioning, i.e., we allow an operator to be executed both on a stream processor and data sources. We implement our algorithm in a system called Jarvis, which enables quick adaptation to dynamic resource conditions. Our evaluation on a diverse set of monitoring workloads suggests that Jarvis converges to a stable query partition within seconds of a change in node resource conditions. Compared to current partitioning strategies, Jarvis handles up to 75% more data sources while improving throughput in resource-constrained scenarios by 1.2-4.4x.
翻译:快速检测和缓解影响业绩和可靠性的问题对于大规模在线服务来说至关重要。对于实时检测此类问题,数据中心操作员使用流处理器,分析从服务器(称为数据源节点)及其主机服务收集的监测数据流。及时处理流入的流需要网络传输大量数据,并大量计算资源来处理这些数据。这些因素往往为流解分析造成瓶颈。为了帮助克服这些瓶颈,目前的监测系统使用近数据处理,要么根据成本模型或者使用模型-神经性螺旋体,计算最佳的查询分区。优化的对流处理是计算成本昂贵的,同时对从服务器(称为数据源节点节点)及其主机服务收集的监测数据流进行互动和搜索。我们将这些方法结合起来,使用模型-认知性超常量处理法来改进隔流解决方案。此外,当前的系统使用操作员一级隔断法:如果数据源没有足够资源来执行所有记录上的操作者,则操作员只能通过流流流流处理当前对当前流程进行不透析。相反,我们用数据源点的运行者在运行流程中进行快速分析,而我们则使用资源级对数据源点进行系统进行快速分析。