There are many general purpose benchmark datasets for Semantic Textual Similarity but none of them are focused on technical concepts found in patents and scientific publications. This work aims to fill this gap by presenting a new human rated contextual phrase to phrase matching dataset. The entire dataset contains close to $50,000$ rated phrase pairs, each with a CPC (Cooperative Patent Classification) class as a context. This paper describes the dataset and some baseline models.
翻译:语义文字相似性有许多通用基准数据集,但没有一个数据库侧重于专利和科学出版物中发现的技术概念,这项工作旨在通过提出一个新的人类评级背景短语来填补这一空白,以表达匹配数据集的短语。整个数据集包含近50 000美元的评级短语对,每对均以CPC(合作专利分类)为背景。本文介绍数据集和一些基线模型。