在Kannada-英文文本中,用Cannada-English 语言编码混合的字级语言识别的ColLI-Machine学习方法 (CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts)

The task of automatically identifying a language used in a given text is called Language Identification (LI). India is a multilingual country and many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media. Texts containing more than one language are called "code-mixed texts" and are a good source of input for LI. Languages in these texts may be mixed at sentence level, word level or even at sub-word level. LI at word level is a sequence labeling problem where each and every word in a sentence is tagged with one of the languages in the predefined set of languages. In order to address word level LI in code-mixed Kannada-English (Kn-En) texts, this work presents i) the construction of code-mixed Kn-En dataset called CoLI-Kenglish dataset, ii) code-mixed Kn-En embedding and iii) learning models using Machine Learning (ML), Deep Learning (DL) and Transfer Learning (TL) approaches. Code-mixed Kn-En texts are extracted from Kannada YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding. The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other". The learning models, namely, CoLI-vectors and CoLI-ngrams based on ML, CoLI-BiLSTM based on DL and CoLI-ULMFiT based on TL approaches are built and evaluated using CoLI-Kenglish dataset. The performances of the learning models illustrated, the superiority of CoLI-ngrams model, compared to other models with a macro average F1-score of 0.64. However, the results of all the learning models were quite competitive with each other.

翻译：自动识别给定文本中使用的语言的任务称为语言识别(LI) 。印度是一个多语言国家, 许多印度年轻人, 特别是印度年轻人除了使用当地语言外, 也喜欢印地语和英语。因此, 他们常常使用一种以上的语言在社交媒体上发表评论。含有一种语言的文本被称为“ 代码混合文本 ”, 并且是LIL的一个很好的输入源。这些文本中的语言可以在句级、字级甚至子词级上混杂在一起。字级的LI是一个序列标签问题, 其中句中的每个词都用预先定义的语言之一标注印地语和英语。为了在代码混合的 Knnnada- 英语(Kn- En) 文本, 本文i) 构建代码混合的 Kn- En 数据设置, 代码混合的Kn- En- En 嵌入和三) 学习模式使用机器学习( MLI)、深学习(DL) 和转移学习( TLLL) 模式, 代码- KMLI-LY 数据库的代码是“ KMLI-LIS 数据模型模型中的“ ” 模型, K- ILIS- mex- ta-li-li-li- 数据库的“ 数据数据模型的“ 数据模型的每个数据数据数据数据的模型, 的构建到 K-li-li-li-li-li-li- d- 的“ 数据数据数据数据数据的“ 演示到 Kn-li-li- d” 数据的“ 的“ 数据的“ 的“ 数据的“ 数据的” 的“ 数据的“ ” ” ” ” 数据的“ 的“ 的” 的” 的” 的“ 的” 的” 的” 的“ 的” 的” 的“ 的” 的“ 的“基-li-li-li-li-li-li-li-li-li-li-li-li-li-li- d- d- d- d” 的” 的” 的” 的” 的” 的” 的“基的“