We consider an extended notion of reinforcement learning in which the environment can simulate the agent and base its outputs on the agent's hypothetical behavior. Since good performance usually requires paying attention to whatever things the environment's outputs are based on, we argue that for an agent to achieve on-average good performance across many such extended environments, it is necessary for the agent to self-reflect. Thus weighted-average performance over the space of all suitably well-behaved extended environments could be considered a way of measuring how self-reflective an agent is. We give examples of extended environments and introduce a simple transformation which experimentally seems to increase some standard RL agents' performance in a certain type of extended environment.
翻译:我们认为,强化学习的扩大概念是,环境可以模拟代理人的假设行为,并根据代理人的假设行为来模拟其产出。由于良好的表现通常要求注意环境产出所依据的任何事物,因此我们认为,要使代理人在许多这样的延长环境中实现平均良好业绩,就必须使代理人自我反省。因此,在所有适当、妥善管理的长期环境中的加权平均业绩可以被视为衡量代理人如何自我反射的一种方法。我们举例说明了扩展环境,并引入了一种简单的转变,这种转变在实验上似乎会提高某些标准的RL代理人在某种延长环境中的表现。