Unsupervised sentiment analysis is traditionally performed by counting those words in a text that are stored in a sentiment lexicon and then assigning a label depending on the proportion of positive and negative words registered. While these "counting" methods are considered to be beneficial as they rate a text deterministically, their classification rates decrease when the analyzed texts are short or the vocabulary differs from what the lexicon considers default. The model proposed in this paper, called Lex2Sent, is an unsupervised sentiment analysis method to improve the classification of sentiment lexicon methods. For this purpose, a Doc2Vec-model is trained to determine the distances between document embeddings and the embeddings of the positive and negative part of a sentiment lexicon. These distances are then evaluated for multiple executions of Doc2Vec on resampled documents and are averaged to perform the classification task. For three benchmark datasets considered in this paper, the proposed Lex2Sent outperforms every evaluated lexicon, including state-of-the-art lexica like VADER or the Opinion Lexicon in terms of classification rate.
翻译:不受监督的情绪分析传统上是通过在情绪词汇中存储的文本中计算这些词来进行,然后根据所登记的正字和负字的比例来分配标签。 虽然这些“计算”方法被认为是有益的,因为它们对文本进行定分,但是当分析的文本短或词汇与词汇法认为的默认值不同时,它们的分类率会下降。本文中提议的模型叫做Lex2Sent, 是一种未经监督的情绪分析方法,用来改进情绪词汇方法的分类。为此目的,对Doc2Vec模型进行了培训,以确定文件嵌入与情绪词汇表正和负部分的嵌入之间的距离。然后,根据重印文档的多处D2Vec,对这些距离进行评估,并平均进行分类任务。对于本文中考虑的三种基准数据集,拟议的Lex2Sent 超越了每一个经过评估的词汇,包括VADER或Lexicionicon 等最新格数法或VAADER或意见Lexicion Lexicon在分类率方面的距离。