To obtain extensive annotated data for under-resourced languages is challenging, so in this research, we have investigated whether it is beneficial to train models using multi-task learning. Sentiment analysis and offensive language identification share similar discourse properties. The selection of these tasks is motivated by the lack of large labelled data for user-generated code-mixed datasets. This paper works on code-mixed YouTube comments for Tamil, Malayalam, and Kannada languages. Our framework is applicable to other sequence classification problems irrespective of the size of the datasets. Experiments show that our multi-task learning model can achieve high results compared with single-task learning while reducing the time and space constraints required to train the models on individual tasks. Analysis of fine-tuned models indicates the preference of multi-task learning over single-task learning resulting in a higher weighted F1-score on all three languages. We apply two multi-task learning approaches to three Dravidian languages: Kannada, Malayalam, and Tamil. Maximum scores on Kannada and Malayalam were achieved by mBERT subjected to cross-entropy loss and with an approach of hard parameter sharing. Best scores on Tamil was achieved by DistilBERT subjected to cross-entropy loss with soft parameter sharing as the architecture type. For the tasks of sentiment analysis and offensive language identification, the best-performing model scored a weighted F1-score of (66.8\% and 90.5\%), (59\% and 70\%), and (62.1\% and 75.3\%) for Kannada, Malayalam, and Tamil on sentiment analysis and offensive language identification, respectively. The data and approaches discussed in this paper are published in Github\footnote{\href{https://github.com/SiddhanthHegde/Dravidian-MTL-Benchmarking}{Dravidian-MTL-Benchmarking}}.
翻译:为了获得大量资源不足的语言附加说明数据,我们调查了利用多任务学习来培训模型是否有益。 感官分析和攻击性语言识别具有相似的谈话属性。 选择这些任务是因为缺少大量标签数据, 用于用户生成的代码混合数据集。 本文涉及泰米尔语、 马拉亚拉姆语和坎纳达语的代码混合YouTube评论。 我们的框架适用于其他序列分类问题, 不论数据集大小。 实验显示, 我们的多任务学习模型与单任务学习相比, 能够取得高成果, 并减少培训单个任务模型所需的时间和空间限制。 微调模型分析显示, 多任务学习优于单任务数据集数据集数据集数据集。 本文针对泰米尔语、 马拉亚拉姆语和泰米尔语的多任务。 肯纳达和马亚拉姆之间的最高分分数是: 泰米尔亚德里亚语( Kanninada, Malyadal) 。