Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burning period. During this period, MAB algorithms operate like randomized experiments, incurring large burning costs which scale with the large number of products. We attempt to reduce the burning by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.
翻译:多武装土匪算法(MAB)是降低在线实验机会成本的有效方法,公司利用这些算法从定期更新的产品目录中找到最佳产品,但算法在试验开始时面临所谓的冷开点,因为缺乏对客户偏好新产品的知识,需要有一个称为燃烧期的初始数据收集阶段。在此期间,MAB算法像随机实验一样运作,造成大量产品规模的巨大燃烧成本。我们试图通过确定许多产品可以被制成双面产品,然后自然地用一个矩阵来模拟产品的奖赏,该矩阵的行和列分别代表两面。接下来,我们设计了两阶段土匪算法,首先使用子标本和低级矩阵估计,以获得数量小得多的产品,然后在目标产品上应用UCB程序寻找最佳产品。我们理论上地表明,拟议的算法成本较低,并在试验时间有限时加快试验速度,然后自然地用一个矩阵来模拟产品的奖赏,其行和柱子分别代表两面。我们的分析还揭示了长期、短期和超水平数据模型的三种制度。我们的分析还展示了从空间、超级数据流上和超级数据流中展示了真实数据。