We consider the reinforcement learning (RL) setting, in which the agent has to act in unknown environment driven by a Markov Decision Process (MDP) with sparse or even reward free signals. In this situation, exploration becomes the main challenge. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization that was previously considered by Hazan et al. (2019) in the discounted setting. For this type of exploration, we propose an algorithm based on a game theoretic representation that has $\widetilde{\mathcal{O}}(H^3 S^2 A / \varepsilon^2)$ sample complexity thus improving the $\varepsilon$-dependence of Hazan et al. (2019), where $S$ is a number of states, $A$ is a number of actions, $H$ is an episode length, and $\varepsilon$ is a desired accuracy. The second type of entropy we study is the trajectory entropy. This objective function is closely related to the entropy-regularized MDPs, and we propose a simple modification of the UCBVI algorithm that has a sample complexity of order $\widetilde{\mathcal{O}}(1/\varepsilon)$ ignoring dependence in $S, A, H$. Interestingly enough, it is the first theoretical result in RL literature establishing that the exploration problem for the regularized MDPs can be statistically strictly easier (in terms of sample complexity) than for the ordinary MDPs.
翻译:我们考虑的是强化学习(RL)设置,在这种设置中,代理人必须在未知环境中行动,这种环境是由马可夫决策程序(MDP)驱动的,其驱动环境是稀少的,甚至奖励免费信号。在这种情况下,勘探成为主要挑战。在这项工作中,我们研究两种不同类型的最大酶性勘探问题。第一类是哈赞等人(2019年)先前在折扣环境下考虑的访问性激化最大化。对于这种勘探,我们建议一种基于游戏理论性代表法的算法,该游戏性代表法的复杂度为美元全方位(H3,S%1,S%2,A/Varepsilon<unk> 2,样本复杂度,从而改进哈赞等人等人(2019年)的美元依赖性最大酶性研究问题。美元是一些行动,Hazan等人(2019年)以前在折扣环境下曾考虑过,而美元是精度最大化最大化最大化最大化。我们研究的第二类型是轨迹。这个目标功能与普通的RVDR&D(R)的精度研究的精度成性研究的精度性研究基础性研究结果密切相关。</s>