This paper describes a novel dataset consisting of sentences with semantic similarity annotations. The data originate from the journalistic domain in the Czech language. We describe the process of collecting and annotating the data in detail. The dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the annotation as an average of 9 individual annotations. We evaluate the quality of the dataset by measuring inter and intra annotation annotators' agreements. Beside agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116 956), the model can perform significantly better than an average annotator (0,92 versus 0,86 of Person's correlation coefficients).
翻译:本文描述了一个由带有语义相似说明的句子组成的新数据集。数据来自捷克语的新闻报道领域。我们详细描述数据收集和说明的过程。数据集包含138,556个人文注解,分为火车和测试组。总共有485名新闻学生参与了创建过程。为了提高测试集的可靠性,我们将注解计算为平均9个个人注解。我们通过测量跨注解协议和内部注解协议来评估数据集的质量。我们提供了所收集数据集的详细统计数据。我们用一个基线实验来结束我们的论文,以建立一个系统来预测判决的语义相似性。由于培训说明数量庞大(116,956),模型可以比平均的注解者(0.92对0.86个人相关系数)表现得好得多。