Sentiment Classification is a fundamental task in the field of Natural Language Processing, and has very important academic and commercial applications. It aims to automatically predict the degree of sentiment present in a text that contains opinions and subjectivity at some level, like product and movie reviews, or tweets. This can be really difficult to accomplish, in part, because different domains of text contains different words and expressions. In addition, this difficulty increases when text is written in a non-English language due to the lack of databases and resources. As a consequence, several cross-domain and cross-language techniques are often applied to this task in order to improve the results. In this work we perform a study on the ability of a classification system trained with a large database of product reviews to generalize to different Spanish domains. Reviews were collected from the MercadoLibre website from seven Latin American countries, allowing the creation of a large and balanced dataset. Results suggest that generalization across domains is feasible though very challenging when trained with these product reviews, and can be improved by pre-training and fine-tuning the classification model.
翻译:感化分类是自然语言处理领域的一项基本任务,具有非常重要的学术和商业应用,目的是自动预测含有某种层次的意见和主观性的文本,例如产品和电影评论或推文中的情绪程度,这可能很难实现,部分原因是不同的文本领域包含不同的文字和表达方式。此外,由于缺少数据库和资源,以非英语语言编写文本会增加难度。因此,为了改进结果,经常对这项任务采用多种跨领域和跨语言技术。在这项工作中,我们进行了一项关于分类系统能力的研究,培训了庞大的产品审查数据库,以推广到不同的西班牙领域。审查是从七个拉丁美洲国家的MercadoLibre网站上收集的,允许创建大型和平衡的数据集。结果显示,在进行这些产品审查培训时,跨领域的一般化是可行的,但具有很大挑战性,而且可以通过培训前和调整分类模式加以改进。</s>