Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with $S$ states, $A$ actions, planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In particular, the \emph{constants} in the bonus should be subtly setting to ensure optimism and monotonicity. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $\Omega\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} up to logarithmic terms. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019] and Zanette et al. [2019] in terms of the dependency on $H$, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on $S$, $A$ and $K$ in sample complexity.
翻译:Epipphic 强化学习和背景土匪是两个广泛研究的连续决策问题。{pipsodic 强化学习概括了背景土匪,而且由于长期规划视野和未知的基于状态的过渡,通常被认为更加困难。当前文件显示,长期规划前景和未知的基于状态的过渡(最多)给抽样复杂性带来很少的额外困难。我们认为,以美元为州、美元为美元的行动、规划地平线$(美元),总奖赏受$(美元)的约束,以及代理商为美元事件播放。我们提议一种新的算法,\ textbf{M}onotitic 学习概括了背景土匪,由于长期规划地平面以及基于美元设计好的单调值(美元),新的红利更加紧密。特别,奖金中的\emph{controup $(美元) 和trocial dicialtical(美元),我们用美元=Krightnal\ kral\ mal=al deal sals。我们展示了Sqral=qral=qral=alalalalalalalalalal= sal=al sal_ sal sal sal sal) 在Sal sal salxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx