We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so any primal-dual algorithm is not guaranteed to converge or find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE's original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE's use of divergence. Consequently, nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.
翻译:我们为估算目标政策国家分布与非政策强化学习中抽样分布之间的密度比例提出了“梯度DICE ” 。 “梯度DICE ” 解决了GENDICE(Zhang等人,2020年)的一些问题,即估算这种密度比例的“最新”问题。 也就是说,一旦引入了优化的不线性变量参数化,那么GENDICE的优化问题就不是一个共性(conex-concave ship-poke-point-point-point-point-point-point-point-point-point-point-point- point- point- point- point- point- point- point- point- point- point- point- point- point- point- point- point- point- point- comn- comnation- comn- comnation- code- ser- et- int- int- int- et- int- ser- int- int- int- ser- int- int- int- int- coolence- coolence- suplection- int- orn- int- suplection- supol- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int- int-