Recently, deep reinforcement learning (DRL) methods have achieved impressive performance on tasks in a variety of domains. However, neural network policies produced with DRL methods are not human-interpretable and often have difficulty generalizing to novel scenarios. To address these issues, prior works explore learning programmatic policies that are more interpretable and structured for generalization. Yet, these works either employ limited policy representations (e.g. decision trees, state machines, or predefined program templates) or require stronger supervision (e.g. input/output state pairs or expert demonstrations). We present a framework that instead learns to synthesize a program, which details the procedure to solve a task in a flexible and expressive manner, solely from reward signals. To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to first learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task. Experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms DRL and program synthesis baselines while producing interpretable and more generalizable policies. We also justify the necessity of the proposed two-stage learning scheme as well as analyze various methods for learning the program embedding.
翻译:最近,深入强化学习(DRL)方法在各个领域的任务上取得了令人印象深刻的成绩。然而,用DRL方法制作的神经网络政策并非人类解释的,而且往往难以概括为新的情景。为了解决这些问题,先前的工作探索学习较易解释和结构化的、较容易概括化的方案政策。然而,这些工作要么采用有限的政策代表(例如决策树、国家机器或预先界定的程序模板),要么需要更有力的监督(例如投入/产出州配对或专家演示)。我们提出了一个框架,而不是学习综合一个程序,该程序详细规定以灵活和明确的方式解决一项任务的程序,而只是利用奖励信号。为了减轻学习编成方案的困难,以便从头开始引导理想的代理行为。我们提议首先学习一个嵌入空间的方案,以不统一的方式持续地将不同的行为结合起来,然后搜索将空间嵌入的学习程序,以产生一个能够最大限度地实现某项任务的回报的方案。实验结果表明,拟议的框架不仅学会可靠地综合任务解决方案,而且还从奖励信号出发,同时解释各种学习方法。我们提出可以理解的学习模式。