We consider an extended notion of reinforcement learning in which the environment can simulate the agent and base its outputs on the agent's hypothetical behavior. Since good performance usually requires paying attention to whatever things the environment's outputs are based on, we argue that for an agent to achieve on-average good performance across many such extended environments, it is necessary for the agent to self-reflect. Thus, an agent's self-reflection ability can be numerically estimated by running the agent through a battery of extended environments. We are simultaneously releasing an open-source library of extended environments to serve as proof-of-concept of this technique. As the library is first-of-kind, we have avoided the difficult problem of optimizing it. Instead we have chosen environments with interesting properties. Some seem paradoxical, some lead to interesting thought experiments, some are even suggestive of how self-reflection might have evolved in nature. We give examples and introduce a simple transformation which experimentally seems to increase self-reflection.
翻译:我们考虑扩大强化学习的概念,环境可以在其中模拟代理人的假设行为,并根据代理人的假设行为来模拟其产出。由于良好的表现通常要求注意环境产出所依据的任何事物,因此我们争辩说,要使代理人在许多如此大的环境中取得平均的良好业绩,就必须使代理人能够自我反省。因此,代理人的自我反省能力可以通过在扩大环境的电池中运行代理人来进行数字估计。我们正在同时释放一个开放源的扩展环境图书馆,作为这种技术的证明。由于图书馆是同类图书馆,我们避免了优化它的困难问题。我们选择了具有有趣特性的环境。有些似乎自相矛盾,有些导致有趣的思考实验,有些甚至暗示了自我反省在性质上是如何演进的。我们举例并引入一种简单的转变,实验性地似乎会增加自我反省。