The degree of semantic relatedness (or, closeness in meaning) of two units of language has long been considered fundamental to understanding meaning. Automatically determining relatedness has many applications such as question answering and summarization. However, prior NLP work has largely focused on semantic similarity (a subset of relatedness), because of a lack of relatedness datasets. Here for the first time, we introduce a dataset of semantic relatedness for sentence pairs. This dataset, STR-2021, has 5,500 English sentence pairs manually annotated for semantic relatedness using a comparative annotation framework. We show that the resulting scores have high reliability (repeat annotation correlation of 0.84). We use the dataset to explore a number of questions on what makes two sentences more semantically related. We also evaluate a suite of sentence representation methods on their ability to place pairs that are more related closer to each other in vector space.
翻译:两个语言单位的语义关联程度(或含义上的近距离)长期以来一直被认为是理解含义的根本。自动确定关联性有许多应用,例如问答和概括性。然而,先前的国家语言规划工作主要侧重于语义相似性(关联性子集),因为缺乏关联性数据集。这里我们第一次为判刑配对引入了语义关联性数据集。这个数据集(STS-2021)有5,500对英语句子,用比较注解框架人工附加语义关联性说明。我们表明,由此产生的评分具有很高的可靠性(0.84的复述注相关性)。我们使用数据集来探索如何使两句语系关系更加密切的若干问题。我们还评估了一套关于它们能否在矢量空间放置更相近的配对的句式代表方法。