Natural Language Processing (NLP) relies heavily on training data. Transformers, as they have gotten bigger, have required massive amounts of training data. To satisfy this requirement, text augmentation should be looked at as a way to expand your current dataset and to generalize your models. One text augmentation we will look at is translation augmentation. We take an English sentence and translate it to another language before translating it back to English. In this paper, we look at the effect of 108 different language back translations on various metrics and text embeddings.
翻译:自然语言处理( NLP) 严重依赖培训数据。 变换器随着其规模的扩大,需要大量的培训数据。 为了满足这一要求, 文本扩增应该被视为扩大当前数据集和概括模型的一种方法。 一个文本扩增是翻译扩增。 我们用英文句子将其翻译成另一种语言, 然后再将其翻译成英文。 在本文中, 我们查看108种不同语言的翻译对各种度量和文本嵌入的影响 。