Detecting performance issues and identifying their root causes in the runtime is a challenging task. Typically, developers use methods such as logging and tracing to identify bottlenecks. These solutions are, however, not ideal as they are time-consuming and require manual effort. In this paper, we propose a method to automate the task of detecting latency outliers using system-level traces and then comparing them to identify the root cause(s). Our method makes use of dependency graphs to show internal interactions between threads and system resources. With these graphs, one can pinpoint where performance issues occur. However, a single trace can be composed of a large number of requests, each generating one graph. To automate the task of identifying outliers within the dataset, we use machine learning density-based models and statistical calculations such as -score. Our evaluation shows an accuracy greater than 97 % on outlier detection, making them appropriate for in-production servers and industry-level use cases.
翻译:在运行时检测性能问题并确定其根源是一项艰巨的任务。 开发者通常使用诸如记录和追踪等方法来找出瓶颈。 然而,这些解决方案并不理想, 因为它们耗时且需要人工操作。 在本文件中, 我们建议了一种方法, 将使用系统级的痕迹来探测潜伏离线的任务自动化, 然后比较它们以找出根源。 我们的方法是使用依赖图来显示线条和系统资源之间的内部互动。 通过这些图表, 人们可以确定出现性能问题的地方。 但是, 单个追踪可以由大量请求组成, 每一个都生成一个图形。 要在数据集内识别外部点的任务自动化, 我们使用机器学习基于密度的模型和统计计算方法, 如 - 集合点。 我们的评估显示外点检测的精确度超过97%, 使其适合生产中的服务器和行业级使用案例 。