Inverse reinforcement learning (IRL) aims to estimate the reward function of optimizing agents by observing their response (estimates or actions). This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed. We present a generalized Langevin dynamics algorithm to estimate the reward function $R(\theta)$; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(\theta))$. The proposed IRL algorithms use kernel-based passive learning schemes. We also construct multi-kernel passive Langevin algorithms for IRL which are suitable for high dimensional data. The performance of the proposed IRL algorithms are illustrated on examples in adaptive Bayesian learning, logistic regression (high dimensional problem) and constrained Markov decision processes. We prove weak convergence of the proposed IRL algorithms using martingale averaging methods. We also analyze the tracking performance of the IRL algorithms in non-stationary environments where the utility function $R(\theta)$ jump changes over time as a slow Markov chain.
翻译:反强化学习(IRL) 旨在通过观察各种响应(估计或行动)来估计优化代理商的奖励功能的奖励功能(IRL) 。 本文在观察到多个随机梯度代理商产生的奖励函数梯度的响亮估计值时考虑IRL。 我们提出了一个通用的Langevin动态算法,以估计奖励函数$R( theta) $; 具体来说, 由此产生的Langevin算法无源生成比例分配与$( R( theta) $) 比例的样本。 拟议的IRL算法使用内核被动学习计划。 我们还为IRL 构建了适合高维数据的多内核被动朗埃文算法。 拟议的IRL 算法的性能在适应性巴耶西亚学习、 物流回归( 高度问题) 和 限制 Markov 决策程序方面的例子中得到了说明。 我们证明拟议IRL 算法使用恒定平均方法在非固定环境中的跟踪功能( $R( theta) $) 跳动过程的运行情况。