在基于推断的深强化学习中共同适应数值分析和实施创新 (Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning)

Recently many algorithms were devised for reinforcement learning (RL) with function approximation. While they have clear algorithmic distinctions, they also have many implementation differences that are algorithm-independent and sometimes under-emphasized. Such mixing of algorithmic novelty and implementation craftsmanship makes rigorous analyses of the sources of performance improvements across algorithms difficult. In this work, we focus on a series of off-policy inference-based actor-critic algorithms -- MPO, AWR, and SAC -- to decouple their algorithmic innovations and implementation decisions. We present unified derivations through a single control-as-inference objective, where we can categorize each algorithm as based on either Expectation-Maximization (EM) or direct Kullback-Leibler (KL) divergence minimization and treat the rest of specifications as implementation details. We performed extensive ablation studies, and identified substantial performance drops whenever implementation details are mismatched for algorithmic choices. These results show which implementation or code details are co-adapted and co-evolved with algorithms, and which are transferable across algorithms: as examples, we identified that tanh Gaussian policy and network sizes are highly adapted to algorithmic types, while layer normalization and ELU are critical for MPO's performances but also transfer to noticeable gains in SAC. We hope our work can inspire future work to further demystify sources of performance improvements across multiple algorithms and allow researchers to build on one another's both algorithmic and implementational innovations.

翻译：最近设计了许多算法,用功能近似来强化学习(RL) 。虽然它们有明确的算法区别,但它们也有许多执行差异,这些差异不依赖算法,有时强调不足。这种算法新颖与实施工艺的混合使得很难严格分析各种算法的性能改进来源。在这项工作中,我们侧重于一系列基于政策外推论的基于行为体-批评算法的算法(RL) -- -- MPO、AWR和SAC -- -- 以调和它们的算法创新和执行决定。我们通过单一控制-推断目标提出统一得出,我们可以根据预期-最大化(EM)或直接的库尔贝克-利伯尔(KL)对各种算法进行分类,这样可以严格分析各种不同算法的改进来源。我们进行了广泛的调和研究,并在执行细节与算法选择不匹配时,我们发现了大量的业绩下降。这些结果显示,我们哪些执行或代码的细节是共同调整的,与算法改进是相互转换的,并且可以跨越算法的,并且可以将每种算法的算法区分:作为例子,我们发现,在高层次的变的变的网络和高层次上,而高分级的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的算法是,我们的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的算法,在了。