The recent success of reinforcement learning's (RL) in solving complex tasks is most often attributed to its capacity to explore and exploit an environment where it has been trained. Sample efficiency is usually not an issue since cheap simulators are available to sample data on-policy. On the other hand, task oriented dialogues are usually learnt from offline data collected using human demonstrations. Collecting diverse demonstrations and annotating them is expensive. Unfortunately, use of RL methods trained on off-policy data are prone to issues of bias and generalization, which are further exacerbated by stochasticity in human response and non-markovian belief state of a dialogue management system. To this end, we propose a batch RL framework for task oriented dialogue policy learning: causal aware safe policy improvement (CASPI). This method gives guarantees on dialogue policy's performance and also learns to shape rewards according to intentions behind human responses, rather than just mimicking demonstration data; this couple with batch-RL helps overall with sample efficiency of the framework. We demonstrate the effectiveness of this framework on a dialogue-context-to-text Generation and end-to-end dialogue task of the Multiwoz2.0 dataset. The proposed method outperforms the current state of the art on these metrics, in both case. In the end-to-end case, our method trained only on 10\% of the data was able to out perform current state in three out of four evaluation metrics.
翻译:近期强化学习在解决复杂任务方面取得成功(RL),这主要是因为它有能力探索和利用经过培训的环境。抽样效率通常不是一个问题,因为政策抽样数据具有廉价的模拟器。另一方面,以任务为导向的对话通常从利用人类演示收集的离线数据中吸取。收集各种演示和说明是昂贵的。不幸的是,使用关于政策外数据的培训的RL方法容易产生偏差和概括化问题,而由于人类反应的随机性和对话管理系统的非典型信仰状态,这些问题进一步加剧。为此目的,我们建议为面向任务的对话政策学习建立一个批量RL框架:有因果关系的安全政策改进(CASPI)。这种方法为对话政策业绩提供保证,并学会根据人类反应背后的意图作出奖励,而不是仅仅模拟演示数据;与批量-RL的结合有助于框架的抽样效率。我们展示了这一框架在对话-文本生成和对话管理系统的非标志性信仰状态上的有效性。我们提议了一个针对任务导向性对话的政策性政策学习:有因果关系的安全政策改进(CASPI)。这种方法为对话政策业绩提供了保障,并学会根据人类反应后的意图,而不是模拟示范性示范性示范性数据数据,这对当前模式的三种最后方法的影响。在MUILOULO2中,我们目前的数据格式中,只显示了三种矩阵中,只有立式的立式数据最后的立式方法。