具有有限时间保证的高多元国家空间Markov决策过程的结构估计 (Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees)

We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states. This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized. Several approaches have been proposed to alleviate the computational burden of this nested-loop structure, but these methods still suffer from high complexity when the state space is either discrete with large cardinality or continuous in high dimensions. Other approaches in the inverse reinforcement learning (IRL) literature emphasize policy estimation at the expense of reduced reward estimation accuracy. In this paper we propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces without compromising reward estimation accuracy. In the proposed algorithm, each policy improvement step is followed by a stochastic gradient step for likelihood maximization. We show that the proposed algorithm converges to a stationary solution with a finite-time guarantee. Further, if the reward is parameterized linearly, we show that the algorithm approximates the maximum likelihood estimator sublinearly. Finally, by using robotics control problems in MuJoCo and their transfer settings, we show that the proposed algorithm achieves superior performance compared with other IRL and imitation learning benchmarks.

翻译：我们认为,根据所采取行动和所访问国家的可观察历史,估算人类代理人动态决定的结构模式是一项任务。这个问题具有固有的嵌套结构:在内部问题中,确定对某一奖赏功能的最佳政策,而在外部问题中,则确定一个适度的尺度。我们提出了几种办法,以减轻这个巢状环形结构的计算负担,但当国家空间与大基点离散或具有高维度时,这些方法仍然非常复杂。反向强化学习文献中的其他办法强调政策估计,以降低奖赏估计的准确性为代价。在本文件中,我们提出一个单环估计算法,配有有限的时间保证,在不损害奖励估计准确性的情况下处理高维度的状态空间。在拟议的算法中,每条政策改进步骤后都有一个偏差的梯度梯度步骤,以便有可能实现最大化。我们表明,拟议的算法在有限的时间保证下,我们用线性参数表示,算算法可以接近最高可能性的测算结果,而我们用比较的机器人测算法,最后用其他测算法显示,我们用其他的测算法的测算法,通过比较性测算法的测算,我们用其他测算法的测算法的测测算,用其他的测算结果,用其他的测算法的测算出了其他测算。