Both in academic and industry-based research, online evaluation methods are seen as the golden standard for interactive applications like recommendation systems. Naturally, the reason for this is that we can directly measure utility metrics that rely on interventions, being the recommendations that are being shown to users. Nevertheless, online evaluation methods are costly for a number of reasons, and a clear need remains for reliable offline evaluation procedures. In industry, offline metrics are often used as a first-line evaluation to generate promising candidate models to evaluate online. In academic work, limited access to online systems makes offline metrics the de facto approach to validating novel methods. Two classes of offline metrics exist: proxy-based methods, and counterfactual methods. The first class is often poorly correlated with the online metrics we care about, and the latter class only provides theoretical guarantees under assumptions that cannot be fulfilled in real-world environments. Here, we make the case that simulation-based comparisons provide ways forward beyond offline metrics, and argue that they are a preferable means of evaluation.
翻译:在学术和行业研究中,在线评价方法被视为建议系统等互动式应用的黄金标准。 当然,原因当然在于我们能够直接衡量依赖干预的通用计量标准,这是向用户展示的建议。然而,在线评价方法成本高昂,显然还需要可靠的离线评价程序。在行业中,离线计量标准常常被用作一线评价,以产生有希望的候选人在线评价模式。在学术工作中,对在线系统的有限访问使得离线计量事实上成为验证新方法的事实上的方法。 存在两类离线计量标准:代用方法和反事实方法。第一类指标往往与我们所关心的在线计量标准关系不大,而后一类指标仅提供假设无法在现实世界环境中实现的理论保证。在这里,我们证明模拟比较提供了超越离线计量方法的前进方法,并说它们是更可取的评价方法。