It is well known that reinforcement learning can be cast as inference in an appropriate probabilistic model. However, this commonly involves introducing a distribution over agent trajectories with probabilities proportional to exponentiated rewards. In this work, we formulate reinforcement learning as Bayesian inference without resorting to rewards, and show that rewards are derived from agent's preferences, rather than the other way around. We argue that agent preferences should be specified stochastically rather than deterministically. Reinforcement learning via inference with stochastic preferences naturally describes agent behaviors, does not require introducing rewards and exponential weighing of trajectories, and allows to reason about agents using the solid foundation of Bayesian statistics. Stochastic conditioning, a probabilistic programming paradigm for conditioning models on distributions rather than values, is the formalism behind agents with probabilistic preferences. We demonstrate realization of our approach on case studies using both a two-agent coordinate game and a single agent acting in a noisy environment, showing that despite superficial differences, both cases can be modeled and reasoned about based on the same principles.
翻译:众所周知,在适当的概率模型中,强化学习可以作为适当的概率模型的推论。然而,这通常涉及对代理人轨迹的分布,其概率与推算奖励的概率成比例。在这项工作中,我们将强化学习作为贝叶斯人的推论,而不诉诸于奖励,并表明奖励来自代理人的偏好,而不是其他方式。我们主张代理人的偏好应当以分法而不是以其他方式加以规定。我们主张,代理人的偏好应当以分法而不是以确定性的方式加以规定。通过以随机偏好的推论来强化学习自然描述代理人的行为,并不要求引入奖励和加速称重的轨迹,而是允许利用贝叶斯人的坚实数据基础来解释代理人的道理。调控调模式而不是以价值来调节模式的概率方案范式,是具有概率偏好倾向的代理人背后的形式主义。我们证明,我们通过使用双试调游戏和一个在紧张的环境中行事的单一代理人,实现了案例研究方法,表明,尽管存在表面差异,但两个案件都可以根据同一原则进行模拟和推理。