This paper presents the Multilingual COVID-19 Analysis Method (CMTA) for detecting and observing the spread of misinformation about this disease within texts. CMTA proposes a data science (DS) pipeline that applies machine learning models for processing, classifying (Dense-CNN) and analyzing (MBERT) multilingual (micro)-texts. DS pipeline data preparation tasks extract features from multilingual textual data and categorize it into specific information classes (i.e., 'false', 'partly false', 'misleading'). The CMTA pipeline has been experimented with multilingual micro-texts (tweets), showing misinformation spread across different languages. To assess the performance of CMTA and put it in perspective, we performed a comparative analysis of CMTA with eight monolingual models used for detecting misinformation. The comparison shows that CMTA has surpassed various monolingual models and suggests that it can be used as a general method for detecting misinformation in multilingual micro-texts. CMTA experimental results show misinformation trends about COVID-19 in different languages during the first pandemic months.
翻译:本文件介绍了多种语文COVID-19分析方法(CMTA),用于检测和观察文本中有关这一疾病的错误信息传播情况;CMTA提议了一种数据科学(DS)管道,在处理、分类(Nense-CNN)和分析(MBERT)多语种(MBERT)文本时应用机器学习模式;DS管道数据编制任务从多语种文本数据中提取特征,并将其分类为特定的信息类别(即“false”、“部分假”、“误差”);CMTA管道已经实验了多语种的微文本(tweets),显示错误信息散布在不同语言之间;为了评估CMTA的绩效,并对此进行审视,我们对CMTA进行了比较分析,使用了八种单一语言模型来检测错误信息;比较表明,CMTA已经超越了各种单一语言模型,并表明它可以用作在多语种微文本中发现错误信息的一般方法;CMTA实验结果显示在第一个大流行月份不同语言中COVID-19的错误趋势。