Recently, several studies (Zhou et al., 2021a; Zhang et al., 2021b; Kim et al., 2021; Zhou and Gu, 2022) have provided variance-dependent regret bounds for linear contextual bandits, which interpolates the regret for the worst-case regime and the deterministic reward regime. However, these algorithms are either computationally intractable or unable to handle unknown variance of the noise. In this paper, we present a novel solution to this open problem by proposing the first computationally efficient algorithm for linear bandits with heteroscedastic noise. Our algorithm is adaptive to the unknown variance of noise and achieves an $\tilde{O}(d \sqrt{\sum_{k = 1}^K \sigma_k^2} + d)$ regret, where $\sigma_k^2$ is the variance of the noise at the round $k$, $d$ is the dimension of the contexts and $K$ is the total number of rounds. Our results are based on an adaptive variance-aware confidence set enabled by a new Freedman-type concentration inequality for self-normalized martingales and a multi-layer structure to stratify the context vectors into different layers with different uniform upper bounds on the uncertainty. Furthermore, our approach can be extended to linear mixture Markov decision processes (MDPs) in reinforcement learning. We propose a variance-adaptive algorithm for linear mixture MDPs, which achieves a problem-dependent horizon-free regret bound that can gracefully reduce to a nearly constant regret for deterministic MDPs. Unlike existing nearly minimax optimal algorithms for linear mixture MDPs, our algorithm does not require explicit variance estimation of the transitional probabilities or the use of high-order moment estimators to attain horizon-free regret. We believe the techniques developed in this paper can have independent value for general online decision making problems.
翻译:最近,一些研究(Zhou等人,2021年a;Zhang等人,2021年b;Kim等人,2021年b;Zhou和Gu,2022年)为线形背景土匪提供了基于差异的遗憾界限,使最坏情况制度和确定性奖赏制度产生遗憾。然而,这些算法要么在计算上难以解决,要么无法处理噪音的未知差异。在本文中,我们提出了一个解决这一公开问题的新办法,即为具有超异性噪音的线形土匪提出第一个计算高效的计算法。我们的算法是适应噪音的未知差异,并实现一个无异性背景背景的背景(sqrt_sumäk=1 ⁇ K\sgma_k%2}+dministic 奖赏制度。但这些算法要么是计算最差的计算,要么是计算每千美元, $ddrod是背景, 美元是具有超常态状态的。我们的结果是基于一个适应性差异-认知的直线性差异, 由新的递定值的递定值的直径直径直径直径直变的算法, 将一个稳定的马氏变的 mal-al-al-al-al-al-al-ral-al-de-al-de-maxxxxxxxxxxxxxx,, 将一个可以使新的递制成一个高的内, 。