Social media often acts as breeding grounds for different forms of offensive content. For low resource languages like Tamil, the situation is more complex due to the poor performance of multilingual or language-specific models and lack of proper benchmark datasets. Based on this shared task, Offensive Language Identification in Dravidian Languages at EACL 2021, we present an exhaustive exploration of different transformer models, We also provide a genetic algorithm technique for ensembling different models. Our ensembled models trained separately for each language secured the first position in Tamil, the second position in Kannada, and the first position in Malayalam sub-tasks. The models and codes are provided.
翻译:对于泰米尔等低资源语言来说,由于多语种或特定语言模式表现不佳,缺乏适当的基准数据集,情况更加复杂。根据这项共同任务,2021年EACLDLDDRAVIDA语言的进攻性语言识别,我们详尽地探索了不同的变压器模型,我们还为融合不同的模型提供了一种遗传算法技术。我们为每一种语言分别培训的混合模型获得了泰米尔语的第一个位置,Kannada的第二个位置,以及Malayalam子任务的第一个位置。提供了模型和代码。