Selecting the optimal recommender via online exploration-exploitation is catching increasing attention where the traditional A/B testing can be slow and costly, and offline evaluations are prone to the bias of history data. Finding the optimal online experiment is nontrivial since both the users and displayed recommendations carry contextual features that are informative to the reward. While the problem can be formalized via the lens of multi-armed bandits, the existing solutions are found less satisfactorily because the general methodologies do not account for the case-specific structures, particularly for the e-commerce recommendation we study. To fill in the gap, we leverage the \emph{D-optimal design} from the classical statistics literature to achieve the maximum information gain during exploration, and reveal how it fits seamlessly with the modern infrastructure of online inference. To demonstrate the effectiveness of the optimal designs, we provide semi-synthetic simulation studies with published code and data for reproducibility purposes. We then use our deployment example on Walmart.com to fully illustrate the practical insights and effectiveness of the proposed methods.
翻译:通过在线勘探开发选择最佳建议者,在传统A/B测试可能缓慢且费用昂贵的地方,人们日益关注传统A/B测试和离线评价容易造成历史数据偏差的地方。找到最佳在线实验是非边际的,因为用户和显示的建议都含有有助于奖励的上下文特征。虽然问题可以通过多武装强盗的镜头正式解决,但发现现有解决办法不尽人意,因为一般方法没有考虑到具体案例的结构,特别是我们研究的电子商务建议。为了填补空白,我们利用传统统计文献中的\emph{D-最优化设计 来在探索期间实现最大程度的信息收益,并揭示它如何与现代在线推断基础设施完全吻合。为了证明最佳设计的有效性,我们提供半合成模拟研究,用已公布的代码和数据来进行再生化。我们然后在Walmart.com上举例来充分说明拟议方法的实际洞察力和效力。