Progress in deep reinforcement learning (RL) research is largely enabled by benchmark task environments. However, analyzing the nature of those environments is often overlooked. In particular, we still do not have agreeable ways to measure the difficulty or solvability of a task, given that each has fundamentally different actions, observations, dynamics, rewards, and can be tackled with diverse RL algorithms. In this work, we propose policy information capacity (PIC) -- the mutual information between policy parameters and episodic return -- and policy-optimal information capacity (POIC) -- between policy parameters and episodic optimality -- as two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty. Evaluating our metrics across toy environments as well as continuous control benchmark tasks from OpenAI Gym and DeepMind Control Suite, we empirically demonstrate that these information-theoretic metrics have higher correlations with normalized task solvability scores than a variety of alternatives. Lastly, we show that these metrics can also be used for fast and compute-efficient optimizations of key design parameters such as reward shaping, policy architectures, and MDP properties for better solvability by RL algorithms without ever running full RL experiments.
翻译:深强化学习(RL)研究的进展在很大程度上是由基准任务环境促成的。然而,分析这些环境的性质往往被忽略。特别是,我们仍然没有可接受的方法来衡量任务的困难或可溶性,因为每个任务都有截然不同的行动、观察、动态、奖励和奖励,并且可以通过不同的RL算法加以解决。在这项工作中,我们提出政策信息能力(PIC) -- -- 政策参数和偶发回报之间的相互信息 -- -- 以及政策参数和政策最佳信息能力(POIC) -- -- 政策参数和偶发最佳性(POIC) -- -- 之间的相互信息 -- -- 作为两个环境创新的、算法的、对任务困难的定量衡量尺度。我们评估了我们横跨玩具环境的衡量标准,以及OpenAI Gym 和深海控制套件的持续控制基准任务。我们从经验上证明,这些信息理论性衡量标准与标准化任务可溶性分数比各种替代数据更密切。最后,我们表明,这些衡量尺度也可以用于快速和精确地优化关键设计参数的优化,例如奖励成型、政策架构和MDP全面性等,通过不进行RDP的演算。