Sentiment Analysis (SA) is a major field of study in natural language processing, computational linguistics and information retrieval. Interest in SA has been constantly growing in both academia and industry over the recent years. Moreover, there is an increasing need for generating appropriate resources and datasets in particular for low resource languages including Persian. These datasets play an important role in designing and developing appropriate opinion mining platforms using supervised, semi-supervised or unsupervised methods. In this paper, we outline the entire process of developing a manually annotated sentiment corpus, SentiPers, which covers formal and informal written contemporary Persian. To the best of our knowledge, SentiPers is a unique sentiment corpus with such a rich annotation in three different levels including document-level, sentence-level, and entity/aspect-level for Persian. The corpus contains more than 26000 sentences of users opinions from digital product domain and benefits from special characteristics such as quantifying the positiveness or negativity of an opinion through assigning a number within a specific range to any given sentence. Furthermore, we present statistics on various components of our corpus as well as studying the inter-annotator agreement among the annotators. Finally, some of the challenges that we faced during the annotation process will be discussed as well.
翻译:感官分析(SA)是自然语言处理、计算语言和信息检索方面的一个主要研究领域,近年来学术界和业界对SA的兴趣不断增加,此外,越来越需要为包括波斯语在内的低资源语言创造适当的资源和数据集,这些数据集在利用受监督、半监督或不受监督的方法设计和开发适当的意见挖掘平台方面发挥着重要作用。在本文件中,我们概述了开发一个手动附加说明的情感材料SentiPers(SentiPers)的整个过程,该材料涵盖当代正式和非正式书面波斯语。据我们所知,SentiPers是一个独特的情感材料,在三个不同层次上都有如此丰富的说明,包括文件级别、句级和实体/方位,波斯语。该材料载有来自数字产品领域的用户意见的26 000多句话,并从特殊特性中获益,例如通过在特定范围内为任何特定句子指定一个数量来量化意见的积极性或否定性。此外,我们介绍了我们本体中某些组成部分的统计数据,并研究了我们最后讨论的难题。