While recent benchmarks have spurred a lot of new work on improving the generalization of pretrained multilingual language models on multilingual tasks, techniques to improve code-switched natural language understanding tasks have been far less explored. In this work, we propose the use of bilingual intermediate pretraining as a reliable technique to derive large and consistent performance gains on three different NLP tasks using code-switched text. We achieve substantial absolute improvements of 7.87%, 20.15%, and 10.99%, on the mean accuracies and F1 scores over previous state-of-the-art systems for Hindi-English Natural Language Inference (NLI), Question Answering (QA) tasks, and Spanish-English Sentiment Analysis (SA) respectively. We show consistent performance gains on four different code-switched language-pairs (Hindi-English, Spanish-English, Tamil-English and Malayalam-English) for SA. We also present a code-switched masked language modelling (MLM) pretraining technique that consistently benefits SA compared to standard MLM pretraining using real code-switched text.
翻译:虽然最近的基准刺激了许多新的工作,改进关于多语种任务的经过事先训练的多语言模式的普遍化,但改进经过编码转换的自然语言理解任务的技术却很少得到探讨,在这项工作中,我们提议使用双语中级培训作为可靠的技术,使用编码转换的文本,在三种不同的非语言协议任务中取得巨大和一致的业绩收益;在中间语言分析中,我们实现了7.87%、20.15%和10.99%的重大绝对改进,比以前印度语-英语自然语言推断(NLI)、问答(QA)任务和西班牙语-英语感官分析(SA)的先进系统的平均理解度和F1分高出了7.87%、20.15%和10.99%的绝对改进。 我们还介绍了一种代码转换的蒙蔽语言模型(MLMM)培训前技术,与使用实际编码扭曲的文本进行的标准MLM培训前培训相比,不断使SA受益。