The paper describes a transformer-based system designed for SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis. The purpose of the task was to predict the intimacy of tweets in a range from 1 (not intimate at all) to 5 (very intimate). The official training set for the competition consisted of tweets in six languages (English, Spanish, Italian, Portuguese, French, and Chinese). The test set included the given six languages as well as external data with four languages not presented in the training set (Hindi, Arabic, Dutch, and Korean). We presented a solution based on an ensemble of XLM-T, a multilingual RoBERTa model adapted to the Twitter domain. To improve the performance of unseen languages, each tweet was supplemented by its English translation. We explored the effectiveness of translated data for the languages seen in fine-tuning compared to unseen languages and estimated strategies for using translated data in transformer-based models. Our solution ranked 4th on the leaderboard while achieving an overall Pearson's r of 0.599 over the test set. The proposed system improves up to 0.088 Pearson's r over a score averaged across all 45 submissions.
翻译:机翻论文摘要:
本文描述了一个基于Transformer的系统,旨在为SemEval-2023任务9:多语言推文亲密度分析设计。任务的目的是预测范围从1(根本没有亲密感)到5(非常亲密)的推文亲密感。比赛的官方训练集包括六种语言(英语、西班牙语、意大利语、葡萄牙语、法语和中文)的推文。测试集包括给定的六种语言以及四种不在训练集中出现的外部数据(印地语、阿拉伯语、荷兰语和韩语)。我们提出了一种基于XLM-T的解决方案,即适应于Twitter领域的多语言RoBERTa模型的集成学习。为了提高未见语言的性能,每个推文都附有其英文翻译。我们探讨了与翻译数据在fine-tuning中的无翻译数据相比,用于转换器模型的翻译数据的有效性,并估计了使用转换器模型中翻译数据的策略。我们的解决方案在排行榜上排名第四,在测试集上实现了总体皮尔森相关系数0.599的结果。该提议的系统比45个提交平均得分提高了0.088的皮尔森相关系数。