We study how representation learning can improve the efficiency of bandit problems. We study the setting where we play $T$ linear bandits with dimension $d$ concurrently, and these $T$ bandit tasks share a common $k (\ll d)$ dimensional linear representation. For the finite-action setting, we present a new algorithm which achieves $\widetilde{O}(T\sqrt{kN} + \sqrt{dkNT})$ regret, where $N$ is the number of rounds we play for each bandit. When $T$ is sufficiently large, our algorithm significantly outperforms the naive algorithm (playing $T$ bandits independently) that achieves $\widetilde{O}(T\sqrt{d N})$ regret. We also provide an $\Omega(T\sqrt{kN} + \sqrt{dkNT})$ regret lower bound, showing that our algorithm is minimax-optimal up to poly-logarithmic factors. Furthermore, we extend our algorithm to the infinite-action setting and obtain a corresponding regret bound which demonstrates the benefit of representation learning in certain regimes. We also present experiments on synthetic and real-world data to illustrate our theoretical findings and demonstrate the effectiveness of our proposed algorithms.
翻译:我们研究代表性学习如何能提高土匪问题的效率。 我们研究的是我们同时玩T$美元线性强盗和维度美元同时玩T$的线性强盗的布局, 而这些$T$的土匪任务共享一个通用的美元( ll d) 维度线性代表。 对于有限行动设置, 我们提出一种新的算法, 实现$( 全方位){O}( T\ qrt{ kN} +\ sqrt{ dkNT} +\ sqrt{ dkNT} $( ) 遗憾, 低限是我们为每个土匪玩的回合数。 当美元足够大的时候, 我们的算法大大超越了天真算法( 独立玩$T$ 土匪), 从而实现了 $( 美元 美元 ) ( 美元 ) ( 美元) 维度线性代表。 对于有限行动 { ( T\ qrr\ d} ) ( ), 我们提出一个新的算法, 我们提供$( T) + sqrockrestal- laviewalalisalalalal resmation) 。