One's ability to learn a generative model of the world without supervision depends on the extent to which one can construct abstract knowledge representations that generalize across experiences. To this end, capturing an accurate statistical structure from observational data provides useful inductive biases that can be transferred to novel environments. Here, we tackle the problem of learning to control dynamical systems by applying Bayesian nonparametric methods, which is applied to solve visual servoing tasks. This is accomplished by first learning a state space representation, then inferring environmental dynamics and improving the policies through imagined future trajectories. Bayesian nonparametric models provide automatic model adaptation, which not only combats underfitting and overfitting, but also allows the model's unbounded dimension to be both flexible and computationally tractable. By employing Gaussian processes to discover latent world dynamics, we mitigate common data efficiency issues observed in reinforcement learning and avoid introducing explicit model bias by describing the system's dynamics. Our algorithm jointly learns a world model and policy by optimizing a variational lower bound of a log-likelihood with respect to the expected free energy minimization objective function. Finally, we compare the performance of our model with the state-of-the-art alternatives for continuous control tasks in simulated environments.
翻译:学习一种没有监督的世界基因模型的能力取决于一个人能够在多大程度上建立能够概括各种经验的抽象知识代表。 为此,从观测数据中获取准确的统计结构提供了有用的感化偏差,可以转移到新环境。在这里,我们通过应用贝叶西亚的非参数性方法来应对学习控制动态系统的问题,这些方法用于解决视觉思维任务。这是通过首先学习国家空间代表,然后通过想象的未来轨迹推断环境动态和改进政策来实现的。巴伊西亚非参数性模型提供自动模型适应,不仅打击不适应和过度适应,而且还允许该模型的无限制的维度既灵活又可按算。我们利用高布西亚进程来发现潜伏的世界动态,从而减轻在强化学习过程中观察到的共同数据效率问题,并避免通过描述系统动态来引入明确的模型偏差。我们的算法共同学习世界模型和政策,方法是优化一个与预期的自由能源最小化目标功能相比的变差的低比值约束。最后,我们将模型的绩效与连续的模拟环境对比。