We study Markov decision processes (MDPs), where agents have direct control over when and how they gather information, as formalized by action-contingent noiselessly observable MDPs (ACNO-MPDs). In these models, actions consist of two components: a control action that affects the environment, and a measurement action that affects what the agent can observe. To solve ACNO-MDPs, we introduce the act-then-measure (ATM) heuristic, which assumes that we can ignore future state uncertainty when choosing control actions. We show how following this heuristic may lead to shorter policy computation times and prove a bound on the performance loss incurred by the heuristic. To decide whether or not to take a measurement action, we introduce the concept of measuring value. We develop a reinforcement learning algorithm based on the ATM heuristic, using a Dyna-Q variant adapted for partially observable domains, and showcase its superior performance compared to prior methods on a number of partially-observable environments.
翻译:我们研究Markov决策程序,其中代理商可以直接控制何时以及如何收集信息,这种程序通过无噪音可观测的与行动无关的 MDP(ACNO-MPDs)正式确定。在这些模型中,行动包括两个组成部分:影响环境的控制行动,以及影响代理商能够观察到的测量行动。为了解决ACNO-MDPs,我们引入了“行动-当期措施(ATM)”超常性能,它假定我们在选择控制行动时可以忽视未来的状态不确定性。我们展示了继此超常性能后,政策计算时间可能会缩短,并证明对超常性能造成的性能损失具有约束性。为了决定是否采取衡量行动,我们引入了衡量价值的概念。我们根据ATM超常性(ATM Heuristic)开发了一种强化学习算法,使用适应部分可观察域的Dyna-Q变式,并展示其优于一些部分可观测环境的以往方法的优异性表现。</s>