An important challenge for news fact-checking is the effective dissemination of existing fact-checks. This in turn brings the need for reliable methods to detect previously fact-checked claims. In this paper, we focus on automatically finding existing fact-checks for claims made in social media posts (tweets). We conduct both classification and retrieval experiments, in monolingual (English only), multilingual (Spanish, Portuguese), and cross-lingual (Hindi-English) settings using multilingual transformer models such as XLM-RoBERTa and multilingual embeddings such as LaBSE and SBERT. We present promising results for "match" classification (86% average accuracy) in four language pairs. We also find that a BM25 baseline outperforms or is on par with state-of-the-art multilingual embedding models for the retrieval task during our monolingual experiments. We highlight and discuss NLP challenges while addressing this problem in different languages, and we introduce a novel curated dataset of fact-checks and corresponding tweets for future research.
翻译:新闻事实检查的一个重要挑战是有效传播现有的事实检查数据。 这反过来又需要可靠的方法来检测先前经过事实检查的主张。 在本文中,我们侧重于自动寻找在社交媒体职位(tweets)上提出的主张的现有事实检查数据。 我们在单语(仅使用英语)、多语种(西班牙语、葡萄牙语)和跨语言(Hindi-英语)环境进行分类和检索实验,使用诸如XLM-RoBERTA等多语言变压器模型以及LABSE和SBERT等多语言嵌入器等多语言变压器。 我们用四种语言提出了“匹配”分类(平均精确度为86% ) 的有希望的结果。 我们还发现,在我们进行单语种实验期间,BM25基线偏差或与最先进的检索任务多语种嵌入模型相当。 我们用不同语言突出和讨论NLP的挑战,同时用不同的语言解决这个问题,我们为未来研究推出一套新整理的数据集和相应的推文。