Reinforcement learning (RL) gained considerable attention by creating decision-making agents that maximize rewards received from fully observable environments. However, many real-world problems are partially or noisily observable by nature, where agents do not receive the true and complete state of the environment. Such problems are formulated as partially observable Markov decision processes (POMDPs). Some studies applied RL to POMDPs by recalling previous decisions and observations or inferring the true state of the environment from received observations. Nevertheless, aggregating observations and decisions over time is impractical for environments with high-dimensional continuous state and action spaces. Moreover, so-called inference-based RL approaches require large number of samples to perform well since agents eschew uncertainty in the inferred state for the decision-making. Active inference is a framework that is naturally formulated in POMDPs and directs agents to select decisions by minimising expected free energy (EFE). This supplies reward-maximising (exploitative) behaviour in RL, with an information-seeking (exploratory) behaviour. Despite this exploratory behaviour of active inference, its usage is limited to discrete state and action spaces due to the computational difficulty of the EFE. We propose a unified principle for joint information-seeking and reward maximization that clarifies a theoretical connection between active inference and RL, unifies active inference and RL, and overcomes their aforementioned limitations. Our findings are supported by strong theoretical analysis. The proposed framework's superior exploration property is also validated by experimental results on partial observable tasks with high-dimensional continuous state and action spaces. Moreover, the results show that our model solves reward-free problems, making task reward design optional.
翻译:强化学习(RL)通过创建最大限度地增加从完全可见的环境得到的收益的决策代理人而得到相当大的关注。然而,许多现实世界问题在自然中是部分或明显可见的,因为代理人没有得到真实和完整的环境状态。这些问题被作为部分可见的Markov 决策程序(POMDPs)提出。有些研究将RL应用到POMDP(POMDP),方法是回顾以前的决定和观察或从所收到观察中推断出真实的环境状况。然而,随着时间的推移,汇集观测和决定是不切实际的,对于具有高度持续状态和行动空间的环境来说是不切实际的。此外,所谓的基于推断的RL(RL)方法需要大量的样本才能很好地运行,因为代理人不理会的不确定性。积极推理的推理结果仅限于在推论状态和深度的实验结果中进行分解。 积极推理的推理性推理过程表明,我们通过不断修正的理论推理的推理方法,我们之间的推理结果是,我们不断推理的推理的推理的推理和推理的推理的推理,我们推理的推理的推理结果也表明,我们之间的推理性推理的推理性推理的推理是,我们之间的推理的推理性推理的推理的推理的推理是,我们之间的推理和推理的推理的推理的推理的推理的推理的推理的推理结果是,,,我们推理是,我们推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理和推理是,我们之间的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理是,, 的推理的推理的推理的推理的推理的推理的推理的推理的推理和推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理是,