Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/CLEVER.
翻译:大规模的常识知识库为广泛的人工智能应用提供了支撑,而自动提取常识知识 (CKE) 是一个根本性且具有挑战性的问题。从文本中提取CKE因文本中常识的稀疏性和报告偏差而备受批评。而视觉感知则包含着丰富的关于现实世界实体的常识知识,比如(人,可以拿起,瓶子),这可以作为获取基于实物的常识知识的有利资源。
在这项工作中,我们提出了CLEVER方法,将CKE表示为一种远程监督的多实例学习问题,其中模型从一组关于实体对的图片中学习总结常识关系,而不需要对图像实例进行任何人工标注。为了解决该问题,CLEVER利用视觉-语言预训练模型深度理解每个图像,并通过一种新颖的对比注意力机制从图片中选择有信息量的实例来总结常识实体关系。全面的实验结果和人类评估表明,CLEVER可以提取出具有良好质量的常识知识,并且胜过了基于预训练语言模型的方法,增加了3.9 AUC和6.4 mAUC点。预测的常识得分和人类判断之间存在着很强的相关性,spearman系数为0.78。此外,提取的常识知识还可以与图像相结合,并具有合理的可解释性。数据和代码可以在https://github.com/thunlp/CLEVER获取。