Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English (Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input.
翻译:将非卡门数据转换成标准语言的词汇正常化,这显示社会媒体上许多自然语言处理任务的业绩有所改善。 然而,尽管这些标准化系统在社交媒体中普遍使用,但这些正常化系统却经常忽略了使用多种语言,也称为代码转换(CS ) 。 在本文中,我们提出了三个专门设计用于处理代码转换数据的正常化模式,我们评估了两种语言对:印度尼西亚语-英语(Id-En)和土耳其语-德语(Tr-De),对于后者,我们为数据集引入了新型的正常化层及其相应的语言ID和POS标记,并评估了标准化对POS标记的下游效应。 结果显示,我们的CS定制的正常化模式超越了艺术和Tr-De单语模式的Id-En状态,并导致POS标记相对于非常规输入的相对性能提高5.4%。