In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing problem and ad audience optimization problem faced by several businesses (especially online platforms), and show how our problem uniquely captures key features of these applications. We show that naive generalizations of existing MAB algorithms like Upper Confidence Bound and Thompson Sampling do not perform well for this problem. We then establish a fundamental lower bound on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem. We also present a simple variant of explore-then-commit and establish near-optimal regret bounds for this algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of algorithms for various instances and recommend a practical guide to employ different algorithms.
翻译:在本文中,我们考虑了多武装土匪(MAB)问题的一个新变体,即有成本补贴的MAB,它模拟了许多现实生活中的应用,学习代理必须支付选择一只手臂的费用,并关注如何优化累积成本和回报。我们介绍了两个应用,智能的SMS路由问题和若干企业(特别是在线平台)面临的特殊受众优化问题,并展示了我们的问题如何独到地捕捉了这些应用的关键特征。我们展示了现有MAB算法(如高信任库普和汤普森取样等)的天真的概括性,对于这一问题效果不佳。我们随后为任何在线学习算法的性能设定了一个基本较低的约束,突出了我们的问题与传统的MAB问题相比的难度。我们还提出了一个探索-当时的简单变体,并为这一算法建立了近于最佳的遗憾界限。最后,我们进行了广泛的数字模拟,以了解各种情况下的一套算法的行为,并建议采用不同算法的实用指南。