Transformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. In this work, we conduct a comparative study between monolingual and multilingual BERT models. We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi. We further show that Marathi monolingual models outperform the multilingual BERT variants on five different downstream fine-tuning experiments. We also evaluate sentence embeddings from these models by freezing the BERT encoder layers. We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts. However, we observe that these embeddings are not generic enough and do not work well on out of domain social media datasets. We consider two Marathi hate speech datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification dataset L3Cube-MahaSent, and Marathi Headline, Articles classification datasets.
翻译:变异器是用于大量自然语言处理任务的最著名架构。 这些模型在大量文本堆中经过预先培训,旨在为文本分类等任务提供最先进的成果。 在这项工作中,我们对单语和多语言的BERT模型进行比较研究。 我们侧重于马拉地语,并评价马拉地语的数据集模型,用于马拉地语中的仇恨言论检测、情绪分析和简单的文本分类。 我们使用标准的多语言模型,如mBERT、dicBERT和xlm-ROBERTA, 并与MahaBERT、MahaALBERT和MahaRoBERTA(马拉地语单语模型)进行比较。 我们进一步显示,马拉地单语单语单语模型比多语言的BERT变异体在5个不同的下游微调实验中超越了多语言的BERT变体。 我们还通过冻结BERT的编码层来评估这些模型中嵌入的句子。 我们显示, 单语的MahBERT模型与多语言分类中的对口语组对应方词语嵌化模型提供了丰富的描述。 但我们认为,这些嵌入式不是通用的面面面话, 并且没有在版域数据里地数据中工作。