The rapid production of data on the internet and the need to understand how users are feeling from a business and research perspective has prompted the creation of numerous automatic monolingual sentiment detection systems. More recently however, due to the unstructured nature of data on social media, we are observing more instances of multilingual and code-mixed texts. This development in content type has created a new demand for code-mixed sentiment analysis systems. In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets. We then proceed to introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets. Our model outperforms the baseline models that use Na\"ive Bayes and Random Forest methods.
翻译:互联网上迅速生成数据,需要从商业和研究角度了解用户的感受,这促使建立了许多自动单语感知检测系统,然而,最近,由于社交媒体上的数据没有结构化,我们正在观测更多多语种和混合代码文本的实例。内容类型的发展产生了对代码混合情绪分析系统的新需求。在这项研究中,我们收集、标签并由此创建了一个波斯文-英语编码组合的推特数据集。我们接着开始引入一个模型,利用BERT预先培训的嵌入以及翻译模型自动学习这些Tweets极分数。我们的模型超越了使用Na'ive Bayes和随机森林方法的基线模型。