CSFEWL和CTKFacts:捷克用于事实核查的数据集 (CsFEVER and CTKFacts: Czech Datasets for Fact Verification)

In this paper we present two Czech datasets aimed for training automated fact-checking machine learning models. Specifically we deal with the task of assessment of a textual claim veracity w.r.t. to a (presumably) verified corpus. The output of the system is the claim classification SUPPORTS or REFUTES complemented with evidence documents or NEI (Not Enough Info) alone. In the first place we publish CsFEVER of approximately 112k claims which is an automatically generated Czech version of the well-known Wikipedia-based FEVER dataset. We took a hybrid approach of machine translation and language alignment, where the same method (and tools we provide) can be easily applied to other languages. The second dataset CTKFacts of 3,097 claims is built on the corpus of approximately two million Czech News Agency news reports. We present an extended methodology based on the FEVER approach. Most notably, we describe a method to automatically generate wider claim contexts (dictionaries) for non-hyperlinked corpora. The datasets are analyzed for spurious cues, which are annotation patterns leading to model overfitting. CTKFacts is further examined for inter-annotator agreement, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline.

翻译：在本文中,我们提出两个捷克数据集,目的是培训自动化事实检查机学习模型。具体地说,我们处理的是评估文本要求真实性(和我们提供的工具)到(可以想象的)核实的文体的任务。系统的输出是索赔分类支持或REFUTES, 并辅以证据文件或NEI( 信息不足) 。首先,我们出版了大约112k索赔要求的CSFEWER, 这是捷克自动生成的、以维基百科为基础的FEW数据集的捷克版本。我们采用了机器翻译和语言对齐的混合方法,其中同一方法(和我们提供的工具)可以很容易地应用于其他语言。第二个3 097索赔要求的CTKFact, 建在大约200万捷克新闻机构新闻报道的文体中。我们根据FEWE(信息不足)的方法介绍了一个扩展的方法。最突出的是,我们描述了一种自动生成非功能连接的CFEVER数据集(词典)的更大范围(词典)的方法。我们分析了数据集,用来分析错误信号,这是导致过度校正的模型的CTFactsculttal a claimatealtraction a caltraction a casetraction for all folformaster

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日