The analysis of data in which multiple languages are represented has gained popularity among computational linguists in recent years. So far, much of this research focuses mainly on the improvement of computational methods and largely ignores linguistic and social aspects of C-S discussed across a wide range of languages within the long-established literature in linguistics. To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies. From the linguistic perspective, we provide an overview of structural and functional patterns of C-S focusing on the literature from European and Indian contexts as highly multilingual areas. From the language technologies perspective, we discuss how massive language models fail to represent diverse C-S types due to lack of appropriate training data, lack of robust evaluation benchmarks for C-S (across multilingual situations and types of C-S) and lack of end-to-end systems that cover sociolinguistic aspects of C-S as well. Our survey will be a step towards an outcome of mutual benefit for computational scientists and linguists with a shared interest in multilingualism and C-S.
翻译:最近几年,计算语言学家对多种语言数据的分析越来越受计算语言学家的欢迎,迄今为止,许多这类研究主要侧重于改进计算方法,并在很大程度上忽略了语言文献中长期存在的多种语言中讨论的C-S语言的语言和社会方面。为填补这一空白,我们对语言文献的编码转换(C-S)进行了调查,对语言技术的关键问题进行了反思。从语言角度看,我们概述了C-S的结构和功能模式,侧重于来自欧洲和印度背景的文献,将其作为高度多语言领域。从语言技术角度看,我们讨论了由于缺乏适当的培训数据、C-S(跨多种语言情况和C-S类型)缺乏可靠的评价基准以及缺乏涵盖C-S社会语言方面的端对端系统,大规模语言模式无法代表多种C-S类型语言。我们的调查将是朝着计算科学家和语言学家相互受益的结果迈出的一步,因为他们对多种语言和C-S有着共同的兴趣。