Incidents in production systems are common and downtime is expensive. Applying an appropriate mitigating action quickly, such as changing a specific firewall rule, reverting a change, or diverting traffic to a different availability zone, saves money. Incident localization is time-consuming since a single failure can have many effects, extending far from the site of failure. Knowing how different system events relate to each other is necessary to quickly identify \emph{where} to mitigate. Our approach, Aggregate Comparison of Traces (ACT), localizes incidents by comparing sets of traces (which capture events and their relationships for individual requests) sampled from the most recent steady-state operation and during an incident. In our quantitative experiments, we show that ACT is able to effectively localize more than 99% of incidents.
翻译:生产系统中的事故是常见的,故障时间是昂贵的。 快速应用适当的缓解行动可以节省资金, 比如改变特定的防火墙规则, 恢复变化, 或者将交通转向不同的可用区。 事件本地化耗时, 因为一个单一的故障可以产生许多影响, 远离故障地点。 了解不同的系统事件彼此关联对于快速识别 \ emph{} 在哪里是必需的。 我们的方法, 线索综合比较( ACT), 通过比较从最近的稳定状态操作和事故中抽样的几组痕迹( 记录事件及其与个人请求的关系), 将事件本地化。 在数量实验中, 我们显示共助药能够有效定位超过99%的事件。