The natural language generation (NLG) module in task-oriented dialogue systems translates structured meaning representations (MRs) into text responses, which has a great impact on users' experience as the human-machine interaction interface. However, in practice, developers often only have a few well-annotated data and confront a high data collection cost to build the NLG module. In this work, we adopt the self-training framework to deal with the few-shot MR-to-Text generation problem. We leverage the pre-trained language model to self-augment many pseudo-labeled data. To prevent the gradual drift from target data distribution to noisy augmented data distribution, we propose a novel data selection strategy to select the data that our generation model is most uncertain about. Compared with existing data selection methods, our method is: (1) parameter-efficient, which does not require training any additional neural models, (2) computation-efficient, which only needs to apply several stochastic forward passes of the model to estimate the uncertainty. We conduct empirical experiments on two benchmark datasets: FewShotWOZ and FewShotSGD, and show that our proposed framework consistently outperforms other baselines in terms of BLEU and ERR.
翻译:以任务为导向的对话系统中的自然语言生成模块(NLG)在任务为导向的对话系统中的自然语言生成模块将结构含义表示(MRs)转化为文本响应,这对用户作为人机互动界面的经验有很大影响,但在实践中,开发者往往只拥有少量附加说明的数据,而且要面对高额数据收集成本来建立NLG模块。在这项工作中,我们采用自培训框架来应对微弱的MR-Text生成问题。我们利用预先培训的语言模型来自我增强许多假标签数据。为了防止目标数据分配逐渐从目标数据流流向噪音增强的数据分配,我们提出了一个新的数据选择战略,以选择我们生成模型最不确定的数据。与现有的数据选择方法相比,我们的方法是:(1) 参数效率,它不需要培训任何额外的神经模型,(2) 计算效率,它只需要使用模型的若干随机前传来估计不确定性。我们在两个基准数据集上进行了实验:很少ShotWOZ和WIPShoSGD, 并显示我们提议的框架始终超越EUR的其他基准。