In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.
翻译:在自然语言处理(NLP)中,混合编码(CM)是一项艰巨的任务,特别是在混合语言包括方言的情况下。在新加坡、印度尼西亚和马来西亚等东南亚国家,Hokkien-Mandarin是中国移民中最普遍的混合代码语言配方,在台湾也是常见的。然而,Hokkien等方言往往缺乏资源,缺乏正式的书写系统,限制了方言CM研究的发展。在本文中,我们提出了一个构建Hokkien-Mandarin CM数据集的方法,以缓解限制,克服中提班语家庭下的形态问题,并通过基于语言的工具包提供高效的Hokkien字分解方法。此外,我们使用我们提议的数据集和传输学习来培训翻译任务XLM(跨语言语言模式),以适应代码混合设想,我们略微调整了XLM。我们发现,通过使用语言知识、规则和语言标记,模型在保持单一语言翻译质量的同时,在HOCM数据翻译方面产生了良好的结果。