This paper presents HURRA, a system that aims to reduce the time spent by human operators in the process of network troubleshooting. To do so, it comprises two modules that are plugged after any anomaly detection algorithm: (i) a first attention mechanism, that ranks the present features in terms of their relation with the anomaly and (ii) a second module able to incorporates previous expert knowledge seamlessly, without any need of human interaction nor decisions. We show the efficacy of these simple processes on a collection of real router datasets obtained from tens of ISPs which exhibit a rich variety of anomalies and very heterogeneous set of KPIs, on which we gather manually annotated ground truth by the operator solving the troubleshooting ticket. Our experimental evaluation shows that (i) the proposed system is effective in achieving high levels of agreement with the expert, that (ii) even a simple statistical approach is able to extracting useful information from expert knowledge gained in past cases to further improve performance and finally that (iii) the main difficulty in live deployment concerns the automated selection of the anomaly detection algorithm and the tuning of its hyper-parameters.
翻译:本文介绍了旨在缩短人类操作者在网络排除故障过程中所花时间的HURRA系统,为此,它包括两个模块,在异常检测算法之后被插入:(一) 第一个关注机制,根据与异常点的关系,将当前特征排序,以及(二) 第二个模块,能够无缝地纳入先前的专家知识,而不需要人的互动或决定。我们展示了这些简单程序在收集从数十个ISP获得的真正路由器数据集方面的功效,这些数据显示有各种各样的异常点和非常不同的KPI系统,我们通过操作者手动收集了附加注释的地面真相。我们的实验评估表明,(一) 拟议的系统对于与专家达成高水平的协议是有效的,(二) 甚至一个简单的统计方法能够从以往案例获得的专家知识中获取有用的信息,以进一步改进绩效,最后,(三) 现场部署的主要困难是自动选择异常检测算法和调整其超参数。