Web-scale applications can ship code on a daily to weekly cadence. These applications rely on online metrics to monitor the health of new releases. Regressions in metric values need to be detected and diagnosed as early as possible to reduce the disruption to users and product owners. Regressions in metrics can surface due to a variety of reasons: genuine product regressions, changes in user population, and bias due to telemetry loss (or processing) are among the common causes. Diagnosing the cause of these metric regressions is costly for engineering teams as they need to invest time in finding the root cause of the issue as soon as possible. We present Lumos, a Python library built using the principles of AB testing to systematically diagnose metric regressions to automate such analysis. Lumos has been deployed across the component teams in Microsoft's Real-Time Communication applications Skype and Microsoft Teams. It has enabled engineering teams to detect 100s of real changes in metrics and reject 1000s of false alarms detected by anomaly detectors. The application of Lumos has resulted in freeing up as much as 95% of the time allocated to metric-based investigations. In this work, we open source Lumos and present our results from applying it to two different components within the RTC group over millions of sessions. This general library can be coupled with any production system to manage the volume of alerting efficiently.
翻译:网络规模应用程序可以每天将代码传送到每周的循环状态。 这些应用程序可以依靠在线指标来监测新释放的健康状况。 需要尽早检测和诊断公制值的下降, 以减少对用户和产品所有者的干扰。 公制值的下降可以由于各种原因出现: 真正的产品回归、 用户数量的变化, 以及由于遥测损失( 或处理) 而产生的偏差等常见原因。 诊断这些公制回归的原因对于工程团队来说是昂贵的,因为他们需要投入时间来尽快找出问题的根源。 我们展示了Lumos, 一个利用AB测试原则建造的Python图书馆, 以系统化地诊断对用户和产品所有者的干扰。 Lumos 已经在微软的实时通信应用程序 Skype 和微软团队的各个组成团队中被部署。 它使工程团队能够检测到指标度值的100种真实变化, 并拒绝由异常探测器检测到的1000种虚假警报。 Lumos的应用使得我们得以将95%的时间从Lums解开来, 。