Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics- and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To test this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network's spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with an ensemble of probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.
翻译:在现有的基于模型的强化学习方法中广泛使用概率动态模型共同体,因为它在无症状性能和抽样效率两方面都比单一动态模型更接近于单一动态模型。在本文件中,我们通过Lipschitz连续性的镜头,就概率动态模型共同体的经验成功经验提供了实际和理论的见解。我们发现,对于一个价值功能而言,利普西茨条件越强,真正的动态和知识性动态驱动的贝尔曼操作员之间的差距就越小,从而使趋同的价值功能更接近于最佳价值功能。因此,我们假设概率动态模型共同体的关键功能是利用生成的样本规范价值函数的利普西茨条件。为了测试这一假设,我们设计了两种实用的强健型培训机制,即计算对抗性噪音和规范价值网络的光谱规范,以直接规范价值函数的利普西茨条件。根据实证结果显示,与我们的机制、基于模型的RL算法的算法比单一的动态模型更接近。因此,我们假设概率模型的关键功能性功能模型比那些具有实际的精确性测算。