Using code-mixed data in natural language processing (NLP) research currently gets a lot of attention. Language identification of social media code-mixed text has been an interesting problem of study in recent years due to the advancement and influences of social media in communication. This paper presents the Instituto Polit\'ecnico Nacional, Centro de Investigaci\'on en Computaci\'on (CIC) team's system description paper for the CoLI-Kanglish shared task at ICON2022. In this paper, we propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts. The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
翻译:在自然语言处理(NLP)研究中使用编码混合数据目前引起人们的极大关注。近年来,由于社交媒体在通信方面的进步和影响,社交媒体混合文本的语言识别是一个有趣的研究问题。本文件介绍了国家理工学院、CIC 计算中心(CIC)团队系统描述文件,用于COLI-Kanglish在ICON202022上分担的任务。本文提议在代码混合的Kannada英文文本中使用基于字级语言识别的变换器模型。 CoLI-Kenglish数据集的拟议模型实现了0.84的加权F1分数和0.61的宏观F1分数。