Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burn-in period. During this period, MAB algorithms operate like randomized experiments, incurring large burn-in costs which scale with the large number of products. We attempt to reduce the burn-in by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.
翻译:多武装土匪算法(MAB)是降低在线实验机会成本的有效方法,被公司用来从定期更新的产品目录中找到最佳产品。然而,这些算法在实验开始时面临所谓的冷开点,原因是缺乏对客户偏好新产品的知识,需要有一个称为燃烧期的初步数据收集阶段。在此期间,MAB算法的运作像随机化实验一样,导致大量产品的大规模燃烧成本。我们试图通过确定许多产品可以投入双面产品,然后自然地用一个矩阵来模拟产品的奖赏,而该矩阵的行和列分别代表两面。我们设计了两阶段的两阶段土匪算法,首先使用子抽样和低级矩阵估计,以获得一套小得多的目标产品,然后在目标产品上应用UCB程序寻找最佳产品。我们理论上表明,拟议的算法降低了成本,并在试验时间有限的情况下加快了实验速度,同时将许多产品投入双面产品,然后自然地以矩阵作为奖励的模型。我们的分析还从一个超高级的模型和模型中展示了三阶段的绩效制度。我们的分析还从一个超高级的模型中展示了三个模型。