We study the design of sample-efficient algorithms for reinforcement learning in the presence of rich, high-dimensional observations, formalized via the Block MDP problem. Existing algorithms suffer from either 1) computational intractability, 2) strong statistical assumptions that are not necessarily satisfied in practice, or 3) suboptimal sample complexity. We address these issues by providing the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level, with minimal statistical assumptions. Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics, a learning objective in which the aim is to predict the learner's own action from the current observation and observations in the (potentially distant) future. MusIK is simple and flexible, and can efficiently take advantage of general-purpose function approximation. Our analysis leverages several new techniques tailored to non-optimistic exploration algorithms, which we anticipate will find broader use.
翻译:使用多步逆运动学的表示学习: 丰富观测强化学习的高效和最优方法
我们研究在复杂、高维观测下的强化学习的样本高效算法设计,通过Block MDP问题进行形式化。现有算法存在三个问题:1) 计算难度大;2) 强的假设在实践中并不一定成立;3) 样本复杂度低效。我们通过提供第一个计算效率高的算法解决这些问题,并且对被要求的精度水平获得最优样本复杂度,且假设尽可能少。我们的算法,MusIK,结合了系统性探索和基于多步逆运动学的表示学习。多步逆运动学是一种学习目标,从当前观察和未来(可能是远的)观察预测学习者自己的行动。MusIK简单灵活,并且可以有效地利用通用的函数逼近。我们的分析利用了针对非乐观探索算法的几种新技术,我们预计这些技术会得到更广泛的应用。