Language models are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a model to train and evaluate multilingual language models in Twitter. In this paper we provide: (1) a new strong multilingual baseline consisting of an XLM-R (Conneau et al. 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently fine-tune on a target task; and (2) a set of unified sentiment analysis Twitter datasets in eight different languages and a XLM-T model fine-tuned on them.
翻译:目前,语言模式在目前国家语言平台中无处不在,其多语种能力最近吸引了相当多的关注,然而,目前的分析几乎完全侧重于标准基准(多语种变量),并依赖清洁的培训前前和具体任务公司作为多语种信号;在本文件中,我们引入了XLM-T,这是在Twitter上培训和评价多语种语言模式的一个模式;在本文中,我们提供了:(1) 一个新的强大的多语种基线,包括XLM-R(Conneau等人,2020年),以30多种语言预先培训了数百万种推特的模型,加上启动代码,随后对目标任务进行微调;(2) 一套以八种不同语言提供的统一的情绪分析推特数据集,以及一个以这些语言进行微调的XLM-T模式。