We study the problem of contextual combinatorial semi-bandits, where input contexts are mapped into subsets of size $m$ of a collection of $K$ possible actions. In each round, the learner observes the realized reward of the predicted actions. Motivated by prototypical applications of contextual bandits, we focus on the $s$-sparse regime where we assume that the sum of rewards is bounded by some value $s\ll K$. For example, in recommendation systems the number of products purchased by any customer is significantly smaller than the total number of available products. Our main result is for the $(\epsilon,\delta)$-PAC variant of the problem for which we design an algorithm that returns an $\epsilon$-optimal policy with high probability using a sample complexity of $\tilde{O}((poly(K/m)+sm/\epsilon^2) \log(|\Pi|/\delta))$ where $\Pi$ is the underlying (finite) class and $s$ is the sparsity parameter. This bound improves upon known bounds for combinatorial semi-bandits whenever $s\ll K$, and in the regime where $s=O(1)$, the leading terms in our bound match the corresponding full-information rates, implying that bandit feedback essentially comes at no cost. Our algorithm is also computationally efficient given access to an ERM oracle for $\Pi$. Our framework generalizes the list multiclass classification problem with bandit feedback, which can be seen as a special case with binary reward vectors. In the special case of single-label classification corresponding to $s=m=1$, we prove an $O((K^7+1/\epsilon^2)\log(|H|/\delta))$ sample complexity bound, which improves upon recent results in this scenario. Additionally, we consider the regret minimization setting where data can be generated adversarially, and establish a regret bound of $\tilde O(|\Pi|+\sqrt{smT\log |\Pi|})$, extending the result of Erez et al. (2024) who consider the simpler single label classification setting.
翻译:我们研究上下文组合半赌博机问题,其中输入上下文被映射到包含K个可能动作的集合中大小为m的子集。在每一轮中,学习者观察到预测动作的已实现奖励。受上下文赌博机典型应用的启发,我们专注于s-稀疏机制,假设奖励总和受某个值s≪K的限制。例如,在推荐系统中,任何客户购买的产品数量显著少于可用产品总数。我们的主要结果针对该问题的(ε,δ)-PAC变体,为此我们设计了一种算法,以高概率返回ε-最优策略,使用的样本复杂度为Õ((poly(K/m)+sm/ε²)log(|Π|/δ)),其中Π是底层(有限)类别,s是稀疏参数。当s≪K时,该界限改进了组合半赌博机的已知界限;在s=O(1)机制下,我们界限中的主导项与相应的完全信息率匹配,表明赌博机反馈本质上没有成本。给定对Π的ERM预言机访问,我们的算法也是计算高效的。我们的框架推广了具有赌博机反馈的列表多类分类问题,这可以看作是二元奖励向量下的特例。在对应于s=m=1的单标签分类特例中,我们证明了O((K⁷+1/ε²)log(|H|/δ))的样本复杂度界限,这改进了该场景下的近期结果。此外,我们考虑数据可以对抗性生成的遗憾最小化设置,并建立了Õ(|Π|+√(smTlog|Π|))的遗憾界限,扩展了Erez等人(2024)考虑更简单单标签分类设置的结果。