改进对差异-适应性线性强盗和无地平地线混合混合混合MDP的遗憾分析 (Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs)

In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.

翻译：在网上学习问题中,利用低差异在获得严格的绩效保障方面起着重要作用,但挑战性却很大,因为差异往往不先知。最近,张等人(2021年)取得了相当大的进展。张等人(2021年)为线形土匪获得了差异调适的遗憾,而他们不知道差异,对线性混合物Markov(MDPs)决策程序也表示无地平移的遗憾。在本文中,我们提出了新的分析,这些分析大大改善了他们的遗憾界限。对于线形土匪,我们实现了美元=Tilde O(min ⁇ d\ sqrt{K}, d ⁇ 1.5sumqrt_qrk=1 ⁇ K@k_k_2+d_2美元,其中美元是时间范围,而美元是时间步差差差差差差差差,而美元是以美元为基底数($_Q_Q_xxxxx)的线性混合混合物,其假设以美元为最高累积的美元=3.5K_xxxx 底数。我们得出了这个基底底数的硬度的数值。