This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text, and set up an efficient pipeline to help construct the multimodal knowledge graph from any data collection. Thanks to the logical information naturally contained in knowledge graph, organizing datasets under UKnow format opens up more possibilities of data usage compared to the commonly used image-text pairs. Following UKnow protocol, we collect, from public international news, a large-scale multimodal knowledge graph dataset that consists of 1,388,568 nodes (with 571,791 vision-related ones) and 3,673,817 triplets. The dataset is also annotated with rich event tags, including 11 coarse labels and 9,185 fine labels. Experiments on four benchmarks demonstrate the potential of UKnow in supporting common-sense reasoning and boosting vision-language pre-training with a single dataset, benefiting from its unified form of knowledge organization. Code, dataset, and models will be made publicly available.
翻译:本文介绍了一种统一的知识协议,称为UKnow,它从数据的视角促进了基于知识的研究。特别是聚焦于视觉和语言模式,将数据知识分为五种单位类型,即图像内,文本内,跨图像,跨文本和图像文本,并建立了一个高效的管道,可帮助从任何数据集构建多模态知识图谱。由于知识图谱自然包含的逻辑信息,将数据集组织为UKnow格式比常用的图像文本对开发出更多数据用途的可能性。根据UKnow协议,我们从公共国际新闻中收集了一个大规模的多模式知识图谱数据集,其中包含1,388,568个节点(其中571,791个与视觉相关)和3,673,817个三元组,并用丰富的事件标签进行了注释,包括11个粗标签和9,185个细标签。四项基准实验证明了UKnow在支持常识推理和通过单个数据集提高视觉语言预训练方面的潜力,从而受益于其统一的知识组织形式。代码,数据集和模型将公开提供。