Sentiment analysis is one of the most fundamental tasks in Natural Language Processing. Popular languages like English, Arabic, Russian, Mandarin, and also Indian languages such as Hindi, Bengali, Tamil have seen a significant amount of work in this area. However, the Marathi language which is the third most popular language in India still lags behind due to the absence of proper datasets. In this paper, we present the first major publicly available Marathi Sentiment Analysis Dataset - L3CubeMahaSent. It is curated using tweets extracted from various Maharashtrian personalities' Twitter accounts. Our dataset consists of ~16,000 distinct tweets classified in three broad classes viz. positive, negative, and neutral. We also present the guidelines using which we annotated the tweets. Finally, we present the statistics of our dataset and baseline classification results using CNN, LSTM, ULMFiT, and BERT-based deep learning models.
翻译:感官分析是自然语言处理的最基本任务之一,英语、阿拉伯语、俄语、普通话等流行语言以及印地语、孟加拉语、泰米尔语等印度语等印度语在这一领域已经做了大量工作,然而,由于缺乏适当的数据集,印度第三最受欢迎的语言马拉地语仍然落后。在本文件中,我们介绍了第一个向公众公开的主要马拉地感应分析数据集-L3CubeMahaSent,它利用从各种Maharashtrian人Twitter帐号中提取的推文加以整理。我们的数据集由大约16 000条不同的推文组成,分为三个大类,即正面、负面和中性。我们还介绍了我们用来附加推文的准则。最后,我们用CNN、LSTM、ULMFiT和BERT的深层学习模型,介绍我们数据集和基线分类结果的统计数据。