Despite the promising results achieved, state-of-the-art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuous monitoring or pre-defined rules, which inevitably result in a cumbersome and expensive learning process. In this paper, we introduce a novel initiative advisor-in-the-loop actor-critic framework, termed as Ask-AC, that replaces the unilateral advisor-guidance mechanism with a bidirectional learner-initiative one, and thereby enables a customized and efficacious message exchange between learner and advisor. At the heart of Ask-AC are two complementary components, namely action requester and adaptive state selector, that can be readily incorporated into various discrete actor-critic architectures. The former component allows the agent to initiatively seek advisor intervention in the presence of uncertain states, while the latter identifies the unstable states potentially missed by the former especially when environment changes, and then learns to promote the ask action on such states. Experimental results on both stationary and non-stationary environments and across different actor-critic backbones demonstrate that the proposed framework significantly improves the learning efficiency of the agent, and achieves the performances on par with those obtained by continuous advisor monitoring.
翻译:尽管取得了令人满意的结果,但是目前最先进的交互式强化学习方案仍然依赖于从顾问专家处被动地接收监管信号,其形式为连续监测或预定义规则,这无疑会导致一种繁琐而昂贵的学习过程。本文介绍了一种新的主动征询式顾问框架,名为Ask-AC,它将单向的顾问指导机制替换为双向的学习者主动机制,从而能够在学习者和顾问之间实现定制化和有效的消息交换。Ask-AC的核心是两个互补的组件,即动作请求器和自适应状态选择器,可以轻松地嵌入各种离散演员评论家体系结构中。前者允许代理在存在不确定状态时主动寻求顾问干预,而后者则在环境发生变化时识别可能被前者忽略的不稳定状态,然后学习促进在这些状态上的询问行为。对于静态和非静态环境以及不同演员评论家骨架,实验结果表明,所提出的框架显著提高了代理的学习效率,并达到了与连续顾问监测获得的性能相当的水平。