具有长地平地平线奖赏的斯托孔背景土匪 (Stochastic Contextual Bandits with Long Horizon Rewards)

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.

翻译：对复杂的决策和语言建模问题的兴趣日益增加,这凸显了数据贫乏(T<h$)和数据丰富的(T\ge h$)制度的重要性。这项工作在这方面迈出了一步,调查了背景线性土匪,因为当前奖励取决于大部分美元先前的行动和背景(不一定连续),直到一个时间范围($h$)。为了避免多语区对美元的依赖,我们提出了新的算法,利用宽度来共同发现依赖模式和手臂参数。我们认为数据贫乏的(T<h$)和数据丰富的(T\ge h$)制度,并得出各自的遗憾上限($\tilde O(d\sqrt{srat{sT}) 和当前奖励的上限值($-min_qq,T ⁇ ) 和 $\tilde O(sqrt{sdT}) 美元,直到一个时间范围(美元)和美元值($qquq) 。我们的上限值(treality) 和(rmal) IMI) 的下, 也使得我们的依赖性模式和(roal-ralimaltialim) IMI) 成为了一种不甚重的基底的回报。