Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Q replay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.
翻译:许多现实世界的持续控制问题处于两难境地:权衡利弊,多目标强化学习(MORL)是不同目标偏好学习控制政策的一般框架;然而,现有的MORL方法要么依靠多次明确寻找Pareto前线,因此没有抽样效率,或者利用共同的政策网络,在政策之间进行粗略的知识分享;为了提高MOR的样本效率,我们提议Q-Pensieve,一个政策改进计划,储存一批Q-Snapshots,以共同确定政策更新方向,从而在政策一级进行数据分享。我们表明,Q-Penseve可以自然地与软的政策重复结合,并有趋同保证。为了证实这一概念,我们提议重放缓冲技术,以储存从过去重复中学到的Q网络,并实现实际的行为者-评论执行。我们通过广泛的实验和一项校准研究,证明,用少得多的样本,拟议的计算算法可以超越MORL的各种基准方法。