In the online world, Machine Translation (MT) systems are extensively used to translate User-Generated Text (UGT) such as reviews, tweets, and social media posts, where the main message is often the author's positive or negative attitude towards the topic of the text. However, MT systems still lack accuracy in some low-resource languages and sometimes make critical translation errors that completely flip the sentiment polarity of the target word or phrase and hence delivers a wrong affect message. This is particularly noticeable in texts that do not follow common lexico-grammatical standards such as the dialectical Arabic (DA) used on online platforms. In this research, we aim to improve the translation of sentiment in UGT written in the dialectical versions of the Arabic language to English. Given the scarcity of gold-standard parallel data for DA-EN in the UGT domain, we introduce a semi-supervised approach that exploits both monolingual and parallel data for training an NMT system initialised by a cross-lingual language model trained with supervised and unsupervised modeling objectives. We assess the accuracy of sentiment translation by our proposed system through a numerical 'sentiment-closeness' measure as well as human evaluation. We will show that our semi-supervised MT system can significantly help with correcting sentiment errors detected in the online translation of dialectical Arabic UGT.
翻译:在网上世界,机器翻译系统(MT)被广泛用于翻译用户创制文本(UGT),例如评论、推文和社交媒体文章,主要信息往往是作者对文本主题的积极或消极态度,然而,在一些低资源语言中,MT系统仍然缺乏准确性,有时还发生关键的翻译错误,完全翻转目标词或短语的情绪极极性,从而传递错误的信息。在不遵循通用词汇法标准(UGT)的文本,例如在线平台上使用的辩证阿拉伯语(DA)中,这一点特别明显。在这项研究中,我们的目标是改进UGT对阿拉伯文对英语的辩证文本中表达的情绪。鉴于UGTT域DA-EN的黄金标准平行数据稀缺,我们采用了半监督性办法,利用单一语言和平行数据来培训NMT系统,而这种培训的基础是一种经过监督和不受监督的建模目标的跨语言模式。我们提议的系统通过在线翻译“UMTGE”系统,评估了我们提议的情绪翻译的准确性,通过数字“UMTGE”系统,将显示我们的“人类感化的自我评估”数据评估,可以大大测量。