Enhancing the diversity of policies is beneficial for robustness, exploration, and transfer in reinforcement learning (RL). In this paper, we aim to seek diverse policies in an under-explored setting, namely RL tasks with structured action spaces with the two properties of composability and local dependencies. The complex action structure, non-uniform reward landscape, and subtle hyperparameter tuning due to the properties of structured actions prevent existing approaches from scaling well. We propose a simple and effective RL method, Diverse Policy Optimization (DPO), to model the policies in structured action space as the energy-based models (EBM) by following the probabilistic RL framework. A recently proposed novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler. DPO follows a joint optimization framework: the outer layer uses the diverse policies sampled by the GFlowNet to update the EBM-based policies, which supports the GFlowNet training in the inner layer. Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies in challenging scenarios and substantially outperform existing state-of-the-art methods.
翻译:加强政策多样性有利于强化学习的稳健性、探索和转让。在本文件中,我们的目标是在探索不足的环境下寻求多样化的政策,即:具有结构化行动空间的、结构化行动空间的任务,具有可调和性和当地依赖性两种特性。复杂的行动结构、非统一的奖励景观和由于结构化行动性质而微妙的超光度调整,使现有方法无法很好地推广。我们建议一种简单有效的RL方法,即多样化政策优化(DPO),以结构化行动空间的政策为模式,遵循概率性RL框架,作为能源基础模型(EBM)。最近提出的新颖而有力的基因化模型,GFlowNet,作为高效、多样化的EBM政策样本。DPO遵循一个联合优化框架:外部层利用GFlowNet抽样的各种政策更新EBM政策,支持内层的GFlowNet培训。关于STAC和战斗基准的实验表明DPO能够有效地发现具有挑战性情景的多样化政策,大大超越现有状态方法。