Sequential decision making in the real world often requires finding a good balance of conflicting objectives. In general, there exist a plethora of Pareto-optimal policies that embody different patterns of compromises between objectives, and it is technically challenging to obtain them exhaustively using deep neural networks. In this work, we propose a novel multi-objective reinforcement learning (MORL) algorithm that trains a single neural network via policy gradient to approximately obtain the entire Pareto set in a single run of training, without relying on linear scalarization of objectives. The proposed method works in both continuous and discrete action spaces with no design change of the policy network. Numerical experiments in benchmark environments demonstrate the practicality and efficacy of our approach in comparison to standard MORL baselines.
翻译:现实世界的有序决策往往需要找到相互冲突的目标之间的良好平衡。 一般来说,存在着大量包含目标之间不同妥协模式的Pareto最佳政策,使用深层神经网络全面获得这些政策在技术上具有挑战性。 在这项工作中,我们提出一种新的多目标强化学习算法,通过政策梯度培训单一神经网络,以在不依赖目标线性尺度化的情况下,在单一培训中大致获得整个Pareto设定的全神经网络。 拟议的方法在连续和分离的行动空间运作,政策网络的设计没有改变。 基准环境中的数值实验表明,与标准MORL基线相比,我们的方法是实用和有效的。</s>