Software performance changes are costly and often hard to detect pre-release. Similar to software testing frameworks, either application benchmarks or microbenchmarks can be integrated into quality assurance pipelines to detect performance changes before releasing a new application version. Unfortunately, extensive benchmarking studies usually take several hours which is problematic when examining dozens of daily code changes in detail; hence, trade-offs have to be made. Optimized microbenchmark suites, which only include a small subset of the full suite, are a potential solution for this problem, given that they still reliably detect the majority of the application performance changes such as an increased request latency. It is, however, unclear whether microbenchmarks and application benchmarks detect the same performance problems and one can be a proxy for the other. In this paper, we explore whether microbenchmark suites can detect the same application performance changes as an application benchmark. For this, we run extensive benchmark experiments with both the complete and the optimized microbenchmark suites of the two time-series database systems InuxDB and VictoriaMetrics and compare their results to the results of corresponding application benchmarks. We do this for 70 and 110 commits, respectively. Our results show that it is possible to detect application performance changes using an optimized microbenchmark suite if frequent false-positive alarms can be tolerated.
翻译:与软件测试框架类似,应用基准或微型基准可以纳入质量保证管道,以在发布新的应用版本之前检测性能变化。不幸的是,广泛的基准研究通常花费数小时时间,在详细审查每天几十个代码变化时有问题;因此,必须作出权衡。优化的微型基准套件仅包括整个套件中的一小部分,是解决这一问题的潜在办法,因为它们仍然可靠地检测了大多数应用性能变化,如请求时间延长等。然而,微基准和应用基准是否检测了同样的性能问题,其中一项可以替代另一个。在本文中,我们探讨微基准套件是否可以检测同样的应用性能变化作为应用基准。在这方面,我们用两个时间序列数据库系统InuxDB和Victoria-Metricls的完整和优化微基准套件进行广泛的基准实验,并将它们的结果与相应的应用基准进行比较。我们用70个和110个模型分别进行模拟性能测试,如果能以最强的性能测试,我们用最强的性能来检测。