Despite the success of mixup in data augmentation, its applicability to natural language processing (NLP) tasks has been limited due to the discrete and variable-length nature of natural languages. Recent studies have thus relied on domain-specific heuristics and manually crafted resources, such as dictionaries, in order to apply mixup in NLP. In this paper, we instead propose an unsupervised learning approach to text interpolation for the purpose of data augmentation, to which we refer as "Learning to INterpolate for Data Augmentation" (LINDA), that does not require any heuristics nor manually crafted resources but learns to interpolate between any pair of natural language sentences over a natural language manifold. After empirically demonstrating the LINDA's interpolation capability, we show that LINDA indeed allows us to seamlessly apply mixup in NLP and leads to better generalization in text classification both in-domain and out-of-domain.
翻译:尽管数据增加的成功,但由于自然语言的离散和不同长度性质,对自然语言处理(NLP)任务的适用性有限,因此,最近的研究依赖特定域的文理学和人工制作的资源,例如词典,以应用NLP的混杂。 在本文中,我们提议为数据增加的目的对文本的内插进行一种不受监督的学习方法,我们称之为“学习数据增加的内插”(LINDA),这不需要任何超自然学或人工制作的资源,而是学会在自然语言上对任何一对自然语言的句子进行内插。在实验性地展示了LINDA的内插能力后,我们表明LINDA确实允许我们在数据增加方面无缝地应用混杂,并导致文字分类的更普遍化,既包括内部的,也包括外部的。