Failures and anomalies in large-scale software systems are unavoidable incidents. When an issue is detected, operators need to quickly and correctly identify its location to facilitate a swift repair. In this work, we consider the problem of identifying the root cause set that best explains an anomaly in multi-dimensional time series with categorical attributes. The huge search space is the main challenge, even for a small number of attributes and small value sets, the number of theoretical combinations is too large to brute force. Previous approaches have thus focused on reducing the search space, but they all suffer from various issues, requiring extensive manual parameter tuning, being too slow and thus impractical, or being incapable of finding more complex root causes. We propose RiskLoc to solve the problem of multidimensional root cause localization. RiskLoc applies a 2-way partitioning scheme and assigns element weights that linearly increase with the distance from the partitioning point. A risk score is assigned to each element that integrates two factors, 1) its weighted proportion within the abnormal partition, and 2) the relative change in the deviation score adjusted for the ripple effect property. Extensive experiments on multiple datasets verify the effectiveness and efficiency of RiskLoc, and for a comprehensive evaluation, we introduce three synthetically generated datasets that complement existing datasets. We demonstrate that RiskLoc consistently outperforms state-of-the-art baselines, especially in more challenging root cause scenarios, with gains in F1-score up to 57% over the second-best approach with comparable running times.
翻译:大型软件系统中的故障和异常现象是不可避免的事件。 当发现一个问题时, 操作员需要快速和正确地确定其位置, 以便于快速修复。 在这项工作中, 我们考虑查明根源组的问题, 从而最能解释具有绝对属性的多维时间序列中的异常现象。 巨大的搜索空间是主要的挑战, 即使对于少数属性和小值组来说, 理论组合的数量也太大, 无法造成粗力。 先前的方法因此侧重于缩小搜索空间, 但是它们都遭受各种问题, 需要广泛的手工参数调整, 过于缓慢, 因而不切实际, 或无法找到更复杂的根源。 我们提出“ 风险Loc” 来解决多层面根源本地化问题。 风险Loc 应用了双向分割方案, 并指定元素重量随着距离偏差点的距离而直线性增加。 每个元素都分配了一个风险分数, 1 在异常间隔中, 其加权比例, 2 以及 2) 偏差得分的相对变化, 需要大量的手工参数调整, 太慢, 或不切实际, 或者无法找到更复杂的根本原因原因。 我们在多个数据中进行广泛的实验,,, 持续地 显示风险- 不断 的精确 的模型生成 。