Generalization is a central problem in machine learning, especially when data is limited. Using prior information to enforce constraints is the principled way of encouraging generalization. In this work, we propose to leverage the prior information embedded in pretrained language models (LM) to improve generalization for intent classification and slot labeling tasks with limited training data. Specifically, we extract prior knowledge from pretrained LM in the form of synthetic data, which encode the prior implicitly. We fine-tune the LM to generate an augmented language, which contains not only text but also encodes both intent labels and slot labels. The generated synthetic data can be used to train a classifier later. Since the generated data may contain noise, we rephrase the learning from generated data as learning with noisy labels. We then utilize the mixout regularization for the classifier and prove its effectiveness to resist label noise in generated data. Empirically, our method demonstrates superior performance and outperforms the baseline by a large margin.
翻译:普遍化是机器学习的一个中心问题,特别是在数据有限的情况下。 使用先前的信息强制实施限制是鼓励笼统化的原则方法。 在这项工作中,我们提议利用预先培训的语言模型(LM)中所含的先前信息来改进意图分类和带有限培训数据的空档标签任务的一般化。 具体地说,我们从事先培训的LM中以合成数据的形式获取先前的知识,该合成数据可以隐含地编码先前的。 我们微调LM,以产生一种强化的语言,不仅包含文字,而且还包含意图标签和空档标签的编码。 生成的合成数据可用于以后对分类员进行培训。 由于生成的数据可能含有噪音,我们把从生成的数据中学习的内容重新表述为对噪音标签的学习。 然后,我们利用分类器的混编规范,并证明其有效抵制生成数据中的标签噪音。 简言之,我们的方法显示优异性,并大大超出基线。