We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function that belongs to a class $\mathcal{F}$. We allow time-varying matroid constraints to be placed on the feasible sets. Assuming access to an online regression oracle with regret $\mathsf{Reg}(\mathcal{F})$, our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy. We show that cumulative regret of this procedure with time horizon $n$ scales as $O(\sqrt{n \mathsf{Reg}(\mathcal{F})})$ against a benchmark with a multiplicative factor $1/2$. On the other hand, using the techniques of (Filmus and Ward 2014), we show that an $\epsilon$-Greedy procedure with local randomization attains regret of $O(n^{2/3} \mathsf{Reg}(\mathcal{F})^{1/3})$ against a stronger $(1-e^{-1})$ benchmark.
翻译:我们考虑的是背景土匪问题,其中的行动是地面组的子集,而平均回报则由属于某类的未知单调子模子函数模拟。 我们允许将时间变化的机器人约束设置在可行的套件上。 假设可以访问在线回归或骨灰, 很遗憾地使用$mathsf{Reg}( mathcal{F}) (\mathcal{F}) $, 我们的算法根据反差距加权战略, 将本地估计功能的本地选择任意地随机地设定。 我们显示, 以时间范围为美元( $) (\\\ qrt{ n\ mathsf{Reg} (\mathcal{F}}} 美元相对于具有倍增倍性系数的基数 1/2美元) 。 另一方面, 我们使用( Filmus and Ward, 2014) 技术, 我们显示, 以本地随机化为单位的 $\ silon- greedy 程序获得了 $( n\2/3)\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\