将寻求信息的探索与奖励的最大化结合起来:对部分可部分观测的连续状态和行动空间的统一推论 (Combining information-seeking exploration and reward maximization: Unified inference on continuous state and action spaces under partial observability)

2022 年 12 月 15 日

Combining information-seeking exploration and reward maximization: Unified inference on continuous state and action spaces under partial observability

翻译：将寻求信息的探索与奖励的最大化结合起来:对部分可部分观测的连续状态和行动空间的统一推论

Parvin Malekzadeh,Konstantinos N. Plataniotis

from arxiv, 34 pages, 7 figures

Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.

翻译：强化学习(RL)通过创建最大限度地增加从完全可见的环境得到的收益的决策代理人而得到相当大的关注。然而,许多现实世界问题在自然中是部分或明显可见的,因为代理人没有得到真实和完整的环境状态。这些问题被作为部分可见的Markov 决策程序(POMDPs)提出。有些研究将RL应用到POMDP(POMDP),方法是回顾以前的决定和观察或从所收到观察中推断出真实的环境状况。然而,随着时间的推移,汇集观测和决定是不切实际的,对于具有高度持续状态和行动空间的环境来说是不切实际的。此外,所谓的基于推断的RL(RL)方法需要大量的样本才能很好地运行,因为代理人不理会的不确定性。积极推理的推理结果仅限于在推论状态和深度的实验结果中进行分解。积极推理的推理性推理过程表明,我们通过不断修正的理论推理的推理方法,我们之间的推理结果是,我们不断推理的推理的推理的推理和推理的推理的推理,我们推理的推理的推理结果也表明,我们之间的推理性推理的推理性推理的推理是,我们之间的推理的推理性推理的推理的推理的推理是,我们之间的推理和推理的推理的推理的推理的推理的推理的推理的推理结果是,,,我们推理是,我们推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理和推理是,我们之间的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理是,, 的推理的推理的推理的推理的推理的推理的推理的推理的推理和推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理是,

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日