Visual Entity Linking (VEL) is a task to link regions of images with their corresponding entities in Knowledge Bases (KBs), which is beneficial for many computer vision tasks such as image retrieval, image caption, and visual question answering. While existing tasks in VEL either rely on textual data to complement a multi-modal linking or only link objects with general entities, which fails to perform named entity linking on large amounts of image data. In this paper, we consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image. The task is to identify objects of interest (i.e., visual entity mentions) in images and link them to corresponding named entities in KBs. Since each entity often contains rich visual and textual information in KBs, we thus propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL). In addition, we present a high-quality human-annotated visual person linking dataset, named WIKIPerson. Based on WIKIPerson, we establish a series of baseline algorithms for the solution of each sub-task, and conduct experiments to verify the quality of proposed datasets and the effectiveness of baseline methods. We envision this work to be helpful for soliciting more works regarding VNEL in the future. The codes and datasets are publicly available at https://github.com/ict-bigdatalab/VNEL.
翻译:视觉实体链接( VEL) 是一项任务, 将图像区域与其在知识库( KBs) 中的相应实体连接起来, 这有利于许多计算机视觉任务, 如图像检索、 图像字幕和视觉问答。 虽然 VEL 中的现有任务要么依靠文本数据来补充多模式链接, 要么只是将对象与普通实体链接, 这些实体未能在大量图像数据上执行名称实体链接。 在本文中, 我们考虑一个纯视觉命名实体链接( VNEL) 任务, 输入只包含一个图像。 任务是在图像中识别感兴趣的对象( 视觉实体提到), 并将它们与 KBs中的相应名称实体链接。 由于每个实体通常包含丰富的视觉和文字信息, 我们因此建议三个不同的子任务, 即视觉实体链接大量图像数据。 视觉实体链接( V2V2TEL), 视觉实体链接( V2TEL), 以及视觉- Textual 实体连接( V2VVTEL) 。 此外, 我们展示了高品质的人类- 视觉数据库中显示未来质量 将VBIL 数据连接起来, 的每个视觉数据的基线数据的基线和亚基 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 校 。