在基于推断的深强化学习中共同适应数值分析和实施创新 (Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning)

Recently many algorithms were devised for reinforcement learning (RL) with function approximation. While they have clear algorithmic distinctions, they also have many implementation differences that are algorithm-independent and sometimes under-emphasized. Such mixing of algorithmic novelty and implementation craftsmanship makes rigorous analyses of the sources of performance improvements across algorithms difficult. In this work, we focus on a series of off-policy inference-based actor-critic algorithms -- MPO, AWR, and SAC -- to decouple their algorithmic innovations and implementation decisions. We present unified derivations through a single control-as-inference objective, where we can categorize each algorithm as based on either Expectation-Maximization (EM) or direct Kullback-Leibler (KL) divergence minimization and treat the rest of specifications as implementation details. We performed extensive ablation studies, and identified substantial performance drops whenever implementation details are mismatched for algorithmic choices. These results show which implementation details are co-adapted and co-evolved with algorithms, and which are transferable across algorithms: as examples, we identified that tanh Gaussian policy and network sizes are highly adapted to algorithmic types, while layer normalization and ELU are critical for MPO's performances but also transfer to noticeable gains in SAC. We hope our work can inspire future work to further demystify sources of performance improvements across multiple algorithms and allow researchers to build on one another's both algorithmic and implementational innovations.

翻译：最近设计了许多算法,用功能近似来强化学习(RL) 。虽然它们有明确的算法区别,但它们也有许多执行差异,这些差异不依赖算法,有时强调不足。这种算法新颖与实施工艺的混合使得很难严格分析各种算法的性能改进来源。在这项工作中,我们集中研究一系列基于政策外推论的基于行为体-批评算法的算法 -- -- MPO、AWR和SAC -- -- 以调和它们的算法创新和执行决定。我们通过单一控制-推断目标提出统一衍生,我们可以根据预期-最大化(EM)或直接的库尔贝克-利贝尔(KL)对各种算法进行分类,从而很难对各种算法的改进进行严格的分析。我们进行了广泛的反动研究,并在执行细节与算法选择不匹配时,确定了大量的业绩下降。这些结果显示,执行细节可以与算法共同调整和共同演变,并且可以在各种算法之间进行调换:例如,我们发现每个算法的算法是根据预期-最大程度的算法将每种算法的算法将每种算算法归为预期性,我们发现,在一种变变的变的变的变的变的变的变的变的变的网络上和历史级化后算法上,而后算法的算算法和变的算法上,在一种变的算法上,我们将一个变的算法的算法的算法的算法的算法上,我们算算法的算法和变的算法的算法是一种变的算算法的算法的算法上,我们的工作的算法的算法的变的变的变的算法是一种变的变的变的变的变的变的变的算法和变的变的变的变的变的变的变的变的算法和变的算法和变的变的变的变的变的算法,我们的算法是,我们发现的变的变的变的算法和变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的算法和变的变的变的变的变的变的变的