Language models are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a framework for using and evaluating multilingual language models in Twitter. This framework features two main assets: (1) a strong multilingual baseline consisting of an XLM-R (Conneau et al. 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently fine-tune on a target task; and (2) a set of unified sentiment analysis Twitter datasets in eight different languages. This is a modular framework that can easily be extended to additional tasks, as well as integrated with recent efforts also aimed at the homogenization of Twitter-specific datasets (Barbieri et al. 2020).
翻译:目前,语言模式在目前国家语言平台中无处不在,其多语种能力最近吸引了相当多的关注,然而,目前的分析几乎完全侧重于标准基准(多语种变量),并依赖清洁的培训前前和任务特定的公司作为多语种信号。本文引入了XLM-T,这是一个在Twitter上使用和评价多语种模式的框架。这个框架有两个主要优势:(1) 一个强大的多语种基线,包括XLM-R(Conneau等人,2020年),对30多种语言的数百万种推文进行了预先培训,并配有启动代码,随后对目标任务进行微调;(2) 一套统一的情绪分析Twitter数据集,使用八种不同语言。这是一个模块化框架,可以很容易扩展到额外的任务,并与近期旨在将特定推特数据集(Barbieri等人,2020年)同化的努力相结合。