In recent years, reinforcement learning (RL) has become increasingly successful in its application to science and the process of scientific discovery in general. However, while RL algorithms learn to solve increasingly complex problems, interpreting the solutions they provide becomes ever more challenging. In this work, we gain insights into an RL agent's learned behavior through a post-hoc analysis based on sequence mining and clustering. Specifically, frequent and compact subroutines, used by the agent to solve a given task, are distilled as gadgets and then grouped by various metrics. This process of gadget discovery develops in three stages: First, we use an RL agent to generate data, then, we employ a mining algorithm to extract gadgets and finally, the obtained gadgets are grouped by a density-based clustering algorithm. We demonstrate our method by applying it to two quantum-inspired RL environments. First, we consider simulated quantum optics experiments for the design of high-dimensional multipartite entangled states where the algorithm finds gadgets that correspond to modern interferometer setups. Second, we consider a circuit-based quantum computing environment where the algorithm discovers various gadgets for quantum information processing, such as quantum teleportation. This approach for analyzing the policy of a learned agent is agent and environment agnostic and can yield interesting insights into any agent's policy.
翻译:近年来,强化学习(RL)在应用于科学和一般科学发现过程方面越来越成功。然而,虽然RL算法学会解决日益复杂的问题,但解释它们所提供的解决方案变得越来越具有挑战性。在这项工作中,我们通过基于序列采矿和集群的热后分析,对RL代理商的学习行为有了深刻的认识。具体地说,该代理商用于解决某项特定任务的频繁和紧凑的子路程被蒸馏成工具,然后被各种指标组合在一起。这个工具发现过程分三个阶段发展:首先,我们使用一个RL代理商来生成数据,然后,我们使用一个采矿算法来提取各种工具,最后,获得的构件是通过基于密度的组合算法组合法组合起来的。我们通过将它应用于两个量子驱动的RL环境来展示我们的方法。首先,我们考虑模拟的量子光学实验,用于设计高度多方联结点的参数。在其中,算法发现与现代干涉仪的配置处理方法相匹配。然后,我们利用一个RL代理商的代理商算法来提取工具,最后,我们考虑将获得的图组化的图式的图式的图式分析环境归成一个用于各种数据。我们把电路路路基级分析环境,用来用于各种数据。