In the maximum state entropy exploration framework, an agent interacts with a reward-free environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting non-Markovianity is generally considered pointless in this setting. In this paper, we argue that non-Markovianity is instead paramount for maximum state entropy exploration in a finite-sample regime. Especially, we recast the objective to target the expected entropy of the induced state visitations in a single trial. Then, we show that the class of non-Markovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer non-zero regret in general. However, we prove that the problem of finding an optimal non-Markovian policy is NP-hard. Despite this negative result, we discuss avenues to address the problem in a tractable way and how non-Markovian exploration could benefit the sample efficiency of online reinforcement learning in future works.
翻译:在最大状态的昆虫勘探框架内,一个代理人与无报酬环境互动,学习一项政策,最大限度地增加预期国家访问的酶性。 哈赞等人(2019年)指出,马尔科维亚类的随机分析政策足以达到最大程度的星盘目标,而在这一背景下,利用非马科维亚性通常被认为毫无意义。在本文中,我们争辩说,非马科维亚性对于在限定抽样制度下进行最大程度的州性昆虫勘探来说是至高无报酬的。特别是,我们在一次试验中重新设定目标,目标是针对引致的州访问的预期酶性。然后,我们表明非马尔科维亚类的确定性政策足以达到提出的目标,而马尔科维亚类政策一般而言并不感到零遗憾。然而,我们证明,找到最佳非马尔科维尼政策的问题是硬的。尽管存在这种负面结果,但我们讨论如何以可移植的方式解决这一问题,以及非马尔科维亚勘探如何使未来作品在线强化学习的抽样效率受益。