离线政策评价和优化的自动递减动态模型 (Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization)

Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning.

翻译：连续控制的标准动态模型使用进化前的计算方法预测下一个状态和奖赏的有条件分布, 以当前状态和行动来预测下一个状态和奖赏的有条件分布。这种建模选择假设下一个状态和奖赏的不同维度由于当前状态和行动而有条件地独立, 并可能因为完全可见的物理模拟环境包含确定性过渡动态这一事实而驱动。在本文中, 我们质疑这一有条件的独立假设, 并提议一个表达式自动递减动态模型的组合, 这些模型产生下一个状态的不同维度, 并按前几个维度顺序进行奖赏。我们证明, 自动递增动态模型的确在延缓冲过渡时, 超越了日志上的标准向前模式。此外, 我们比较了基于不同模型和无模型的离政策评价方法, 也就是一个离线 MuJoco 数据集的套件, 并发现, 自动递增性动态模型持续超越所有基线, 实现新的状态- 艺术。我们证明, 自动递增性动态模型在延缩政策性优化过程中, 以缓冲性模型为更新性, 以更新性模型, 更新性模型, 更新性模型, 更新性模型, 更新性模型, 更新性模型, 更新性模型, 更新到更新到更新性模型, 更新性模型, 更新到更新到更新到更新性更新性更新性更新性更新性升级性更新性优化性优化性模型, 更新性优化优化性优化性更新性优化性优化性优化性优化性。