This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.
翻译:本文介绍了HWR团队向2021年VarDial讲习班组织的Dravidian语言识别(DLI)共同任务提交的材料,DLI培训组包括16 674个YouTube评论,以罗马文撰写,含有英文和南Dravidian三种语言之一(Kannada、Malayalam和Tamil)的编码混合文本,本文介绍了HWR团队向Dravidian语言识别(DLI)共同任务提交的材料,该团队使用两种模式提交了成果,一种是具有适应性语言模型的Naive Bayes分类器,该模型显示在许多语言和方言识别任务中取得了竞争性性能,另一种基于变压器的模型被广泛视为国家语言定位项目中的最新技术。我们提交的第一份文件仅使用共同任务组织者提供的培训集,而第二份文件被认为开放,因为它使用了经过外部数据培训的预先培训的模式。我们团队在共同的任务中与Naive Bayes的提交材料分享了第二个位置。我们的结果强化了这样一种想法,即深层次的学习方法在语言识别相关任务中并不具有竞争力,因为其他文本分类任务中的竞争性。