Performance debugging in production is a fundamental activity in modern service-based systems. The diagnosis of performance issues is often time-consuming, since it requires thorough inspection of large volumes of traces and performance indices. In this paper we present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems. DeLag identifies subsets of requests that show, in the combination of their Remote Procedure Call execution times, symptoms of potentially relevant performance issues. We call such symptoms Latency Degradation Patterns. DeLag simultaneously searches for multiple latency degradation patterns while optimizing precision, recall and latency dissimilarity. Experimentation on 700 datasets of requests generated from two microservice-based systems shows that our approach provides better and more stable effectiveness than three state-of-the-art approaches and general purpose machine learning clustering algorithms. DeLag is more effective than all baseline techniques in at least one case study (with p $\leq$ 0.05 and non-negligible effect size). Moreover, DeLag outperforms in terms of efficiency the second and the third most effective baseline techniques on the largest datasets used in our evaluation (up to 22%).
翻译:在现代面向服务的系统中,生产性能调试是一项基础活动。由于需要彻底检查大量追踪和性能指标,因此性能问题的诊断通常是耗费时间的。在本文中,我们提出了DeLag,这是一种新的用于诊断面向服务的系统中性能问题的自动搜索型方法。DeLag识别出那些通过它们的远程过程调用执行时间组合显示出潜在相关性能问题的请求子集。我们称这样的症状为延迟退化模式。DeLag同时搜索多个延迟退化模式,同时优化准确度、召回率和延迟相似度。对于两个基于微服务的系统生成的700个请求数据集进行实验,结果显示我们的方法比三种最先进的方法和通用机器学习聚类算法提供更好且更稳定的效果。 DeLag比所有基线技术在至少一个案例研究中更有效(p≤0.05 且效果大小不可忽略)。此外,在我们的评估中使用的最大数据集上(高达22%),DeLag的效率优于第二和第三最有效的基线技术。