In this paper, we revisit variational intrinsic control (VIC), an unsupervised reinforcement learning method for finding the largest set of intrinsic options available to an agent. In the original work by Gregor et al. (2016), two VIC algorithms were proposed: one that represents the options explicitly, and the other that does it implicitly. We show that the intrinsic reward used in the latter is subject to bias in stochastic environments, causing convergence to suboptimal solutions. To correct this behavior and achieve the maximal empowerment, we propose two methods respectively based on the transitional probability model and Gaussian mixture model. We substantiate our claims through rigorous mathematical derivations and experimental analyses.
翻译:在本文中,我们重新审视了变式内在控制(VIC),这是一种无人监督的强化学习方法,用于寻找代理人可获得的最大一系列内在选项。在Gregoor等人的最初工作中,提出了两个国际中心算法:一个是明确代表选项的算法,另一个是间接代表选项的算法。我们表明,后者使用的内在奖赏在随机环境中存在偏见,导致与不理想的解决方案趋同。为了纠正这一行为并实现最大程度的赋权,我们分别根据过渡概率模型和高斯混合模型提出了两种方法。我们通过严格的数学衍生法和实验分析证实了我们的要求。