Despite the promising results achieved, state-of-the-art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuous monitoring or pre-defined rules, which inevitably result in a cumbersome and expensive learning process. In this paper, we introduce a novel initiative advisor-in-the-loop actor-critic framework, termed as Ask-AC, that replaces the unilateral advisor-guidance mechanism with a bidirectional learner-initiative one, and thereby enables a customized and efficacious message exchange between learner and advisor. At the heart of Ask-AC are two complementary components, namely action requester and adaptive state selector, that can be readily incorporated into various discrete actor-critic architectures. The former component allows the agent to initiatively seek advisor intervention in the presence of uncertain states, while the latter identifies the unstable states potentially missed by the former especially when environment changes, and then learns to promote the ask action on such states. Experimental results on both stationary and non-stationary environments and across different actor-critic backbones demonstrate that the proposed framework significantly improves the learning efficiency of the agent, and achieves the performances on par with those obtained by continuous advisor monitoring.
翻译:尽管取得了可喜的成果,最先进的互动强化学习计划依靠被动地接受顾问专家以持续监测或预先确定规则的形式提供的监督信号,这不可避免地导致一个繁琐和昂贵的学习过程。 在本文件中,我们引入了名为Ask-AC的新颖的主动倡议,即名为Ask-AC的行为体-批评框架,用双向学习者与顾问之间的双向学习者-主动机制取代单方面顾问-指导机制,从而在学习者与顾问之间进行量身定制和有效的信息交流。在Ask-AC的核心是两个相辅相成的组成部分,即行动请求者和适应性国家选择者,可以很容易地纳入各种独立的行为体-批评结构。前一个部分允许代理人在不确定的国家出现时主动寻求顾问干预,而后一部分则确定了前者可能忽略的不稳定状态,特别是在环境变化时,然后学会促进就这类状态采取行动。在固定和非固定环境以及不同行为体-批评支柱之间的实验结果表明,拟议的框架通过不断提高监督机构的业绩效率,通过不断的顾问实现这些进展。