BERTUT:通过本地变压器在Twitter上理解西班牙语 (BERTuit: Understanding Spanish language in Twitter through a native transformer)

from arxiv, Support: 1) BBVA FOUNDATION - CIVIC, 2) Spanish Ministry of Science and Innovation - FightDIS (PID2020-117263GB-100) and XAI-Disinfodemics (PLEC2021-007681), 3) Comunidad Autonoma de Madrid - S2018/TCS-4566, 4) European Comission - IBERIFIER (2020-EU-IA-0252), 5) Digital Future Society (Mobile World Capital Barcelona) - DisTrack, 6) UPM - Programa de Excelencia para el Profesorado Universitario

The appearance of complex attention-based language models such as BERT, Roberta or GPT-3 has allowed to address highly complex tasks in a plethora of scenarios. However, when applied to specific domains, these models encounter considerable difficulties. This is the case of Social Networks such as Twitter, an ever-changing stream of information written with informal and complex language, where each message requires careful evaluation to be understood even by humans given the important role that context plays. Addressing tasks in this domain through Natural Language Processing involves severe challenges. When powerful state-of-the-art multilingual language models are applied to this scenario, language specific nuances use to get lost in translation. To face these challenges we present \textbf{BERTuit}, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets using RoBERTa optimization. Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network, with special emphasis on solutions devoted to tackle the spreading of misinformation in this platform. BERTuit is evaluated on several tasks and compared against M-BERT, XLM-RoBERTa and XLM-T, very competitive multilingual transformers. The utility of our approach is shown with applications, in this case: a zero-shot methodology to visualize groups of hoaxes and profiling authors spreading disinformation. Misinformation spreads wildly on platforms such as Twitter in languages other than English, meaning performance of transformers may suffer when transferred outside English speaking communities.

翻译：BERT、罗伯塔或GPT-3等基于关注的复杂语言模型的外观使得在众多的情景中处理非常复杂的任务成为了非常复杂的任务。然而,在应用到特定领域时,这些模型遇到了相当大的困难。例如Twitter等社会网络,Twitter等不断变化的信息流以非正式和复杂的语言写成,每条信息流都需要仔细评估,即使考虑到环境的重要作用,人类也需要仔细评估。通过自然语言处理处理处理这个领域的任务涉及严峻的挑战。当在这一情景中应用最先进的、最先进的多种语言模型来应对非常复杂的任务时,语言特有的细微调在翻译中会丢失。为了应对这些挑战,我们向外部社区展示了这些特殊语言。为了迎接这些挑战,我们为西班牙语提出了规模更大的变异器,例如Twitter(Twitter),在使用RobrebTal Twitter进行大规模数据集培训之前,230M Special Twitter时,我们的动机是提供一种强大的资源,并用于这一社交网络的应用,特别强调解决这个平台中信息传播的解决方案。BERduitutit在与M语言变异端平台上, XLM-LTwima-L应用中展示了一种非常有竞争力的版本。