Multi-armed bandits (MAB) are commonly used in sequential online decision-making when the reward of each decision is an unknown random variable. In practice however, the typical goal of maximizing total reward may be less important than minimizing the total cost of the decisions taken, subject to a reward constraint. For example, we may seek to make decisions that have at least the reward of a reference ``default'' decision, with as low a cost as possible. This problem was recently introduced in the Multi-Armed Bandits with Cost Subsidy (MAB-CS) framework. MAB-CS is broadly applicable to problem domains where a primary metric (cost) is constrained by a secondary metric (reward), and the rewards are unknown. In our work, we address variants of MAB-CS including ones with reward constrained by the reward of a known reference arm or by the subsidized best reward. We introduce the Pairwise-Elimination (PE) algorithm for the known reference arm variant and generalize PE to PE-CS for the subsidized best reward variant. Our instance-dependent analysis of PE and PE-CS reveals that both algorithms have an order-wise logarithmic upper bound on Cost and Quality Regret, making our policies the first with such a guarantee. Moreover, by comparing our upper and lower bound results we establish that PE is order-optimal for all known reference arm problem instances. Finally, experiments are conducted using the MovieLens 25M and Goodreads datasets for both PE and PE-CS revealing the effectiveness of PE and the superior balance between performance and reliability offered by PE-CS compared to baselines from the literature.
翻译:多臂赌博机(MAB)通常用于序列化在线决策场景,其中每次决策的奖励是未知随机变量。然而在实践中,相较于在奖励约束下最小化决策总成本,最大化总奖励的典型目标可能居于次要地位。例如,我们可能希望做出的决策至少能达到参考"默认"决策的奖励水平,同时尽可能降低决策成本。该问题近期在带成本补贴的多臂赌博机(MAB-CS)框架中被提出。MAB-CS广泛适用于主要指标(成本)受次要指标(奖励)约束且奖励未知的问题领域。本文研究MAB-CS的若干变体,包括奖励受已知参考臂奖励约束或受补贴最优奖励约束的变体。针对已知参考臂变体,我们提出成对消除(PE)算法,并将其推广至适用于补贴最优奖励变体的PE-CS算法。对PE和PE-CS的实例相关分析表明,两种算法在成本与质量遗憾方面均具有阶数意义上的对数上界,这使得我们的策略成为首个具备该保证的算法。进一步通过上下界结果对比,我们证明PE对所有已知参考臂问题实例均具有阶数最优性。最后,基于MovieLens 25M和Goodreads数据集对PE和PE-CS进行实验验证,结果表明PE具有显著有效性,且相较于现有基线方法,PE-CS在性能与可靠性之间实现了更优的平衡。