Manual fact-checking does not scale well to serve the needs of the internet. This issue is further compounded in non-English contexts. In this paper, we discuss claim matching as a possible solution to scale fact-checking. We define claim matching as the task of identifying pairs of textual messages containing claims that can be served with one fact-check. We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims that are first annotated for containing "claim-like statements" and then matched with potentially similar items and annotated for claim matching. Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages. We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low- and high-resource languages in our dataset. We provide evaluations on the performance of our solution and compare with baselines and existing state-of-the-art multilingual embedding models, namely LASER and LaBSE. We demonstrate that our performance exceeds LASER and LaBSE in all settings. We release our annotated datasets, codebooks, and trained embedding model to allow for further research.
翻译:手册中的事实检查并不适合互联网的需求。 这个问题在非英语环境中更为复杂。 在本文中, 我们讨论将索赔匹配作为衡量事实检查的可能解决方案。 我们将索赔匹配定义为确定包含索赔的文本信息对齐的任务, 其中包括可以使用一次事实检查的文本信息对齐; 我们建立一个关于“ WhesApp” 提示线和公共群体信息的新数据集, 以及经过事实核实的主张, 这些主张首先附加说明, 包含“ 类似索赔的声明”, 然后与可能相似的项目相匹配, 并附加索赔匹配的附加说明。 我们的数据集包含高资源( 英、印、 印) 和低资源( 班加里、 马拉亚拉姆、 泰米尔) 等语言的内容。 我们用知识蒸馏和高质量的“ 教师” 模型来培训我们自己的嵌入模型, 以解决将低资源语言和高资源语言的质量嵌入我们的数据集中的不平衡问题。 我们提供了解决方案绩效评估, 并与基线和现有先进的多语言嵌入模型( 即LASER和 LaBSE) 。 我们证明我们的业绩超越了我们经过培训的数据, 和升级到所有数据库。