Consider this scenario: an agent navigates a latent graph by performing actions that take it from one node to another. The chosen action determines the probability distribution over the next visited node. At each node, the agent receives an observation, but this observation is not unique, so it does not identify the node, making the problem aliased. The purpose of this work is to provide a policy that approximately maximizes exploration efficiency (i.e., how well the graph is recovered for a given exploration budget). In the unaliased case, we show improved performance w.r.t. state-of-the-art reinforcement learning baselines. For the aliased case we are not aware of suitable baselines and instead show faster recovery w.r.t. a random policy for a wide variety of topologies, and exponentially faster recovery than a random policy for challenging topologies. We dub the algorithm eFeX (from eFficient eXploration).
翻译:考虑以下情况:一个智能体通过执行将其从一个节点带到另一个节点的动作来导航隐含图。所选择的动作确定下一个访问节点的概率分布。在每个节点处,智能体会接收到一个观测结果,但是这个结果是不唯一的,因此无法识别节点,使问题带别名。本文的目的是提供一种策略,该策略大致最大化探索效率(即在给定探索预算下如何恢复图形)。在未带别名的情况下,我们显示了与最先进的强化学习基线相比的性能改进。对于带别名的情况,我们不知道合适的基线,而是显示在各种拓扑结构中比随机策略更快地进行恢复,并且对于具有挑战性的拓扑结构,恢复速度呈指数级增长。我们将该算法命名为 eFeX(来自efficient exploration的简称)。