具有军备等级的内地强盗 (Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy)

Motivated by modern applications, such as online advertisement and recommender systems, we study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select $k$ arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of $O(k\sqrt{(A-k+1)T \log (|\mathcal{F}|T)})$, where $A$ is the total number of arms and $\mathcal{F}$ is the class containing the regression function, while only requiring $\tilde{O}(A)$ computation per time step. In the extreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of $O(k\sqrt{(\log A-k+1)T \log (|\mathcal{F}|T)})$. Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to well-known benchmarks on simulated bandit feedback experiments using extreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.

翻译：受现代应用( 如在线广告和建议系统) 的驱动, 我们研究顶价- 美元极端背景土匪问题, 武器总数可能非常巨大, 学习者可以选择 $k$ 武器总数, 观察所选武器的所有或部分奖励。我们首先提出非极端环境的算法, 使用反差加权战略选择多臂。我们显示我们的算法有 $( k\ sqrt{ (A- k+1)T\log ( ⁇ mathcal{F ⁇ T}}) 的遗憾保证 $( mathcal ) 极端背景, 美元是武器总数和 $\ mathcal{F} 的总数, 并允许包含回归功能的类别 $k 和 $\ mathcal{ F} 。我们首先在极端环境下, 武器总数可以达到百万, 我们提出一个有实际动机的手臂等级模型, 只能带来一定的回报, 保证统计和计算效率。等级结构允许使用 AL_ a cal realalalal oral oral oral_ a cal a cal deal deal deal deal deal deal deal deal 。