Estimation of semantic similarity is crucial for a variety of natural language processing (NLP) tasks. In the absence of a general theory of semantic information, many papers rely on human annotators as the source of ground truth for semantic similarity estimation. This paper investigates the ambiguities inherent in crowd-sourced semantic labeling. It shows that annotators that treat semantic similarity as a binary category (two sentences are either similar or not similar and there is no middle ground) play the most important role in the labeling. The paper offers heuristics to filter out unreliable annotators and stimulates further discussions on human perception of semantic similarity.
翻译:对语义相似性的估计对于各种自然语言处理(NLP)任务至关重要。在缺乏语义信息的一般理论的情况下,许多论文依赖人文旁听者作为语义相似性估计的基本真理来源。本文调查了众源语义标签中固有的模糊性。它表明将语义相似性作为二进制分类(两句相似或不相似,没有中间立场)的旁听者在标签中发挥着最重要的作用。该文件提供了超常性,以过滤不可靠的语义相似性,并激励关于人对语义相似性的认知的进一步讨论。