We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class $\mathcal{F}$, our algorithm achieves a worst-case regret bound of $\widetilde{O}(\mathrm{poly}(d_EH)\sqrt{T})$ where $T$ is the time elapsed, $H$ is the planning horizon and $d_E$ is the $\textit{eluder dimension}$ of $\mathcal{F}$. In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an $\widetilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.
翻译:我们建议了一种由流行的随机最小正方值迭代算法(RLSVI)和乐观原则所启发的无模型强化学习算法。与目前通常在计算上难以操作的基于信任的上层(UCB)方法不同,我们的算法只是用明智选择的 i.d. scalar 噪声干扰培训数据来推动探索。为了在不使用 UCB 式奖金的情况下实现乐观值函数估计,我们引入了一种乐观的奖励抽样程序。当值函数由函数类 $\mathcal{F} 代表时,我们的算法实现了最坏的负数约束 $\ lobelidelde{O}(mathr{poly}(d_H)\ sqrt{T}), 美元是时间的间隔, $H美元是规划地平面值, $_E$是美元=text{eluder 维度 $\macal{F} 。在线性设置中,我们的算法将值降低到LSVI-PHE,这是RLLSVI的变式, Q_\\\\\\\\\\\\\\ dadexcialxxxxxxxxx