We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a drop in the rank of the learned value network features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.
翻译:我们从基于价值的深RL方法中找出了一种隐含的分度不足现象,这些方法使用靴子:当使用深神经网络的近似值函数,在使用先前数值网络实例生成的目标值的迭代回归后,以梯度下降的方式,对价值函数进行梯度下降培训时,更多的梯度更新会降低当前值网络的表达性。我们通过学习价值网络特征的降级来描述这种表达性损失,并表明这通常与性能下降相对应。我们在离线和在线RL设置的Atari和Gym基准上展示了这种现象。我们正式分析了这种现象,并表明它产生于靴子与梯度优化之间的病理互动。我们进一步表明,通过控制分级崩溃来减少隐含的分度不足可以改善性能。