Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/CLEVER.
翻译:在这项工作中,我们介绍CLEWE,它将CKE设计成一个远处监督的多年级学习问题,其中模型从一袋关于一对实体的图像中总结共性关系,而无需在图像实例上作任何人类批注。为解决这一问题,CLEWE利用视觉语言预培训模型来深入了解包中的每一张图像,并从袋中挑选信息实例,通过新的对比关注机制总结共性实体关系。 全面的实验结果和人类评价显示,CLEVE可以将具有可预见质量的共性知识从一袋关于一对实体的图像中总结出来,而在图像实例上不作任何人类批注。为解决这一问题,CLEVE利用视觉语言预培训模型来深入了解包中的每一张图像,并通过新的对比关注机制将信息案例选入袋,总结共性实体关系。