Despite the significant interest and progress in reinforcement learning (RL) problems with adversarial corruption, current works are either confined to the linear setting or lead to an undesired $\tilde{O}(\sqrt{T}\zeta)$ regret bound, where $T$ is the number of rounds and $\zeta$ is the total amount of corruption. In this paper, we consider the contextual bandit with general function approximation and propose a computationally efficient algorithm to achieve a regret of $\tilde{O}(\sqrt{T}+\zeta)$. The proposed algorithm relies on the recently developed uncertainty-weighted least-squares regression from linear contextual bandit \citep{he2022nearly} and a new weighted estimator of uncertainty for the general function class. In contrast to the existing analysis that heavily relies on the linear structure, we develop a novel technique to control the sum of weighted uncertainty, thus establishing the final regret bounds. We then generalize our algorithm to the episodic MDP setting and first achieve an additive dependence on the corruption level $\zeta$ in the scenario of general function approximation. Notably, our algorithms achieve regret bounds either nearly match the performance lower bound or improve the existing methods for all the corruption levels and in both known and unknown $\zeta$ cases.
翻译:尽管在反腐败对抗性腐败的强化学习(RL)问题上存在重大兴趣和进展,但目前的工程要么局限于线性设置,要么导致不希望看到的 $tilde{O}(sqrt{T ⁇ zeta) $(sqrt{T ⁇ Zezeta) 遗憾,因为美元是圆轮数,美元是腐败的总量。在本文中,我们考虑到具有一般功能近似值的背景土匪,并提出了一种计算效率的算法,以实现1美元(美元)的遗憾。拟议的算法要么局限于线性设置,要么导致最近开发的不确定性加权的最小方程从线性背景带回缩(sqrt{cit{ciep{he2022nearly}) 和对一般功能类别不确定性的新的加权估计值。与目前严重依赖线性结构的分析相比,我们开发了一种控制加权不确定性总和数值的新技术,从而确立了最后的遗憾界限。我们随后将我们的算法概括为Sepsodic MDP设置,并首先实现对腐败水平的增量依赖,即美元和我们已知的低级的智能状态,在一般的智能状态上都实现了。