Inverse reinforcement learning~(IRL) is a powerful framework to infer an agent's reward function by observing its behavior, but IRL algorithms that learn point estimates of the reward function can be misleading because there may be several functions that describe an agent's behavior equally well. A Bayesian approach to IRL models a distribution over candidate reward functions, alleviating the shortcomings of learning a point estimate. However, several Bayesian IRL algorithms use a $Q$-value function in place of the likelihood function. The resulting posterior is computationally intensive to calculate, has few theoretical guarantees, and the $Q$-value function is often a poor approximation for the likelihood. We introduce kernel density Bayesian IRL (KD-BIRL), which uses conditional kernel density estimation to directly approximate the likelihood, providing an efficient framework that, with a modified reward function parameterization, is applicable to environments with complex and infinite state spaces. We demonstrate KD-BIRL's benefits through a series of experiments in Gridworld environments and a simulated sepsis treatment task.
翻译:反强化学习 ~ (IRL) 是一个强有力的框架,用以通过观察其行为来推断代理人的奖赏功能,但是,IRL算法如果能了解奖赏函数的点估计值,可能会产生误导,因为可能有若干功能可以同样很好地描述代理人的行为。Bayesian对IRL的处理方法对候选奖赏函数进行分配,减轻了学习点估计值的缺点。然而,一些Bayesian IRL 算法用一个$Q值的函数代替可能性函数。由此产生的后继函数在计算上是密集的,没有多少理论保证,而$Q值函数往往对可能性来说是一个差的近似值。我们引入了有条件内核密度 Bayesian IRL (KD-BIRL), 使用有条件的内核密度估计来直接接近可能性, 提供了一个有效的框架, 有了修改的奖励函数参数化, 适用于复杂和无限的状态空间的环境。 我们通过在克里德世界环境中的一系列实验和模拟的Sepsis治疗任务来证明KD- BIRL的好处。</s>