Two popular approaches to model-free continuous control tasks are SAC and TD3. At first glance these approaches seem rather different; SAC aims to solve the entropy-augmented MDP by minimising the KL-divergence between a stochastic proposal policy and a hypotheical energy-basd soft Q-function policy, whereas TD3 is derived from DPG, which uses a deterministic policy to perform policy gradient ascent along the value function. In reality, both approaches are remarkably similar, and belong to a family of approaches we call `Off-Policy Continuous Generalized Policy Iteration'. This illuminates their similar performance in most continuous control benchmarks, and indeed when hyperparameters are matched, their performance can be statistically indistinguishable. To further remove any difference due to implementation, we provide OffCon$^3$ (Off-Policy Continuous Control: Consolidated), a code base featuring state-of-the-art versions of both algorithms.
翻译:无模式连续控制任务的两个普遍做法是SAC和TD3。 乍一看,这两种做法似乎相当不同;SAC的目的是通过将随机建议政策与虚伪的能源基软功能政策之间的KL差异最小化,解决微小放大的MDP,而TD3则来自DPG,DPG使用确定性政策来在价值函数的同时执行政策梯度。在现实中,这两种方法都非常相似,并属于我们称之为“非政策持续通用政策循环”的一套做法。这在最连续的控制基准中说明了它们的类似性能,事实上,当超参数相匹配时,其性能在统计上是无法区分的。为了进一步消除任何因执行而产生的差异,我们提供了Offcon$3美元(非政策持续控制:合并),这是一个以两种算法的状态为特征的代码基础。