Most existing sentiment analysis approaches heavily rely on a large amount of labeled data that usually involve time-consuming and error-prone manual annotations. The distribution of this labeled data is significantly imbalanced among languages, e.g., more English texts are labeled than texts in other languages, which presents a major challenge to cross-lingual sentiment analysis. There have been several cross-lingual representation learning techniques that transfer the knowledge learned from a language with abundant labeled examples to another language with much fewer labels. Their performance, however, is usually limited due to the imperfect quality of machine translation and the scarce signal that bridges two languages. In this paper, we employ emojis, a ubiquitous and emotional language, as a new bridge for sentiment analysis across languages. Specifically, we propose a semi-supervised representation learning approach through the task of emoji prediction to learn cross-lingual representations of text that can capture both semantic and sentiment information. The learned representations are then utilized to facilitate cross-lingual sentiment classification. We demonstrate the effectiveness and efficiency of our approach on a representative Amazon review data set that covers three languages and three domains.
翻译:多数现有的情绪分析方法都严重依赖大量标签数据,这些数据通常涉及耗费时间和容易出错的人工说明。这种标签数据在各种语文之间分布严重不平衡,例如,比其他语文的文本贴上更多的英文文本标签,这是跨语文情绪分析的一大挑战。已经存在若干跨语文的教学技术,将从带有大量标签的例子的语言学到的知识传授给标签少得多的另一种语言。但是,由于机器翻译质量不完善,连接两种语言的信号很少,其性能通常有限。在本文件中,我们采用无处不在的情感语言,即模版语言,作为跨语文情绪分析的新桥梁。具体地说,我们建议采用半超语化的代言学习方法,通过感化预测来学习能够捕捉语义和情绪信息的跨语种语言的文字表达方式。然后,利用所学的表述方法来促进跨语文情绪的分类。我们展示了我们关于代表亚马逊语审查数据集的方法的有效性和效率,该数据集涵盖三种语言和三个领域。