自动递减实体检索 (Autoregressive Entity Retrieval)

from arxiv, Accepted (spotlight) at International Conference on Learning Representations (ICLR) 2021. Code at https://github.com/facebookresearch/GENRE. 20 pages, 9 figures, 8 tables

Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. Current approaches can be understood as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach has several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion. This mitigates the aforementioned technical issues since: (i) the autoregressive formulation directly captures relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the softmax loss is computed without subsampling negative data. We experiment with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their names. Code and pre-trained models at https://github.com/facebookresearch/GENRE.

翻译：实体位于我们如何代表和共享知识的中心。例如, 诸如 Wikipedia 等百科全书( 维基百科) 等实体是由实体构建的( 例如, 每一份维基百科文章 ) 。检索这些实体的查询能力对于诸如实体链接和开放式答题等知识密集型任务至关重要。目前的方法可以被理解为原子标签中的分类, 每个实体都有。它们的重量矢量是用编码实体的描述等元信息生成的密集的实体表示。这种方法有几个缺点:( 一) 内容和实体的亲近性主要通过矢量点产品( 可能缺少细微的交互作用 ) 。 (二) 需要大缩略微的记忆足迹在考虑大型实体设置时存储密集的表达形式;(三) 适当硬的一组负数据必须在培训时间进行分解。在这项工作中, 我们建议GENRE, 第一个系统通过生成其独有的名称, 向右增缩, 逐个键, 以自动递增的方式, 这可以缓解上述技术问题, 因为:(一) 直接进行自我递增的模型的配置, 直接定义的配置, 直接连接的缩缩缩缩的图像的配置, 和缩缩缩缩缩缩的缩的缩缩的缩的缩的缩缩缩的缩的缩缩缩的缩的缩缩缩缩缩略图,, 是因为我们的缩缩缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩略体名称和缩的缩略的缩略的缩的缩的缩略的缩的缩的缩的缩的缩的缩略的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩的缩略图。