While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC -- a dataset of ~28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking models' accuracy and remarkably pushes span-selection (SS) models (i.e., predicting the start and end index of the target phrase) near-human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common meaning of a phrase regardless of its actual context. SotA models perform poorly in distinguishing two senses of the same phrase in two contexts (~60% EM) and in estimating the similarity between two different phrases in the same context (~70% EM).
翻译:虽然背景化的字嵌入是一个不折不扣的标准,但学习背景化的字嵌入没有那么深入探讨,并受到阻碍,因为缺乏一个人文加注的基准,以测试机器对语义的理解,根据上下文句或段落(而不是单词句)测试语义。为了填补这一空白,我们建议PIC -- -- 一套有约28K的名词嵌入的数据集,同时附有其上下文维基百科页面,以及一套用于培训和评价词嵌入的三项任务。关于PIC的培训提高了排名模型的准确性,并显著推进了跨区域选择模式(即预测了目标语句的开始和结束指数)的准确性,即95%的语义搜索Exact Match(EM),根据一个查询语句和一段段落。有趣的是,这种令人印象深刻的表现是,因为SS模型学会了更好地捕捉一个词句的共同含义,而不论其实际背景如何。 SotA模型在两种背景下(~60 % EM)和估计两个不同语系之间的相似性方面(70 % EM)。