如何衡量你的应用:在在线控制实验中衡量应用绩效的几处空洞和补救措施 (How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments)

Effectively measuring, understanding, and improving mobile app performance is of paramount importance for mobile app developers. Across the mobile Internet landscape, companies run online controlled experiments (A/B tests) with thousands of performance metrics in order to understand how app performance causally impacts user retention and to guard against service or app regressions that degrade user experiences. To capture certain characteristics particular to performance metrics, such as enormous observation volume and high skewness in distribution, an industry-standard practice is to construct a performance metric as a quantile over all performance events in control or treatment buckets in A/B tests. In our experience with thousands of A/B tests provided by Snap, we have discovered some pitfalls in this industry-standard way of calculating performance metrics that can lead to unexplained movements in performance metrics and unexpected misalignment with user engagement metrics. In this paper, we discuss two major pitfalls in this industry-standard practice of measuring performance for mobile apps. One arises from strong heterogeneity in both mobile devices and user engagement, and the other arises from self-selection bias caused by post-treatment user engagement changes. To remedy these two pitfalls, we introduce several scalable methods including user-level performance metric calculation and imputation and matching for missing metric values. We have extensively evaluated these methods on both simulation data and real A/B tests, and have deployed them into Snap's in-house experimentation platform.

翻译：有效衡量、理解和改进移动应用程序的性能对于移动应用程序开发者至关重要。在移动互联网全景中,各公司进行在线控制实验(A/B测试),使用数千个性能衡量尺度,以了解应用性能如何因果影响用户保留,并防范服务或应用倒退,从而降低用户的经验。为了捕捉性能衡量标准的某些特点,例如观测量巨大、分布偏差程度高等,行业标准做法是建立一个性能衡量尺度,以量化衡量A/B测试中控制或处理桶中的所有性能事件。在Snap提供的数千个A/B测试中,我们发现在计算性能衡量标准这一行业标准方法中存在一些缺陷,这可能导致性能衡量指标变化不明,以及意外地与用户参与度衡量标准不符。在本文中,我们讨论了衡量移动应用程序性能的这一行业标准做法的两大缺陷。一个原因是移动设备和用户参与的高度偏差,另一个原因是由于后期用户参与带来的自我选择偏差,我们发现一些行业标准性差,即计算性能测试/模拟用户参与率的变化。我们用两种方法都进行了精确地进行了计算,其中包括标准计算。