Performance debugging in production is a fundamental activity in modern service-based systems. The diagnosis of performance issues is often time-consuming, since it requires thorough inspection of large volumes of traces and performance indices. In this paper we present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems. DeLag identifies subsets of requests that show, in the combination of their Remote Procedure Call execution times, symptoms of potentially relevant performance issues. We call such symptoms Latency Degradation Patterns. DeLag simultaneously searches for multiple latency degradation patterns while optimizing precision, recall and latency dissimilarity. Experimentation on 700 datasets of requests generated from two microservice-based systems shows that our approach provides better and more stable effectiveness than three state-of-the-art approaches and general purpose machine learning clustering algorithms. DeLag is more effective than all baseline techniques in at least one case study (with p $\leq$ 0.05 and non-negligible effect size). Moreover, DeLag outperforms in terms of efficiency the second and the third most effective baseline techniques on the largest datasets used in our evaluation (up to 22%).
翻译:生产中的性能调试是现代服务系统中的一项基本活动。对性能问题的诊断往往耗费时间,因为它要求彻底检查大量的痕量和性能指数。在本文中,我们介绍DeLag,这是诊断服务系统中性能问题的一种新型自动搜索方法。DeLag确定了请求的子集,这些请求结合其远程程序电话执行时间,显示了潜在相关性能问题的症状。我们称这些症状为长期性退化模式。DLag同时搜索多种长期性退化模式,同时优化精确度、回溯性和不相异性。对两个基于微观服务系统提出的700套请求进行实验表明,我们的方法比三种最先进的方法和通用机器学习群算法提供了更好和更稳定的效益。DeLag比至少一个案例研究中的所有基线技术更有效(P\leq$0.05和不明显效果大小)。此外,DeLag在效率方面有超出我们评估中所使用的第2和第3个最大数据集中最有效的基线技术(至22 %)。