This paper demonstrates a two-stage method for deriving insights from social media data relating to disinformation by applying a combination of geospatial classification and embedding-based language modelling across multiple languages. In particular, the analysis in centered on Twitter and disinformation for three European languages: English, French and Spanish. Firstly, Twitter data is classified into European and non-European sets using BERT. Secondly, Word2vec is applied to the classified texts resulting in Eurocentric, non-Eurocentric and global representations of the data for the three target languages. This comparative analysis demonstrates not only the efficacy of the classification method but also highlights geographic, temporal and linguistic differences in the disinformation-related media. Thus, the contributions of the work are threefold: (i) a novel language-independent transformer-based geolocation method; (ii) an analytical approach that exploits lexical specificity and word embeddings to interrogate user-generated content; and (iii) a dataset of 36 million disinformation related tweets in English, French and Spanish.
翻译:本文展示了从社交媒体数据中获取与虚假信息有关的洞察力的两阶段方法,即结合多种语言的地理空间分类和嵌入语言建模,特别是以Twitter为中心的分析以及三种欧洲语言(英语、法语和西班牙语)的假信息。首先,Twitter数据使用BERT分类为欧洲和非欧洲数据集。第二,Word2vec应用到导致三种目标语言的数据以欧洲为中心、非欧元为中心和全球表示的分类文本。这一比较分析不仅表明分类方法的功效,而且突出与错误信息有关的媒体的地理、时间和语言差异。因此,这项工作的贡献有三重:(一) 新的基于语言的变异器地理定位方法;(二) 利用词汇特性和文字嵌入的语言来询问用户生成的内容的分析方法;以及(三) 以英文、法文和西班牙文提供的3 600万条不真实的推文数据集。