Recent state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This free form of supervision ensures high generality and usability of the learned visual models, based on extensive heuristics on data collection to cover as many visual concepts as possible. Alternatively, learning with external knowledge about images is a promising way which leverages a much more structured source of supervision. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge to build transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that can understand both visual concepts and their knowledge; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods.
翻译:最新最先进的计算机视觉系统从自然语言监督培训,从简单的对象类别名称到描述性说明,从自然语言监督到从简单的对象类别名称到描述性说明。这种自由监督形式确保基于数据收集的广泛累进式数据收集,尽可能涵盖许多视觉概念的已学视觉模型的高通用性和可用性。另一种办法是,利用外部知识学习图像,是利用更结构化得多的监督来源的一个很有希望的方法。在本文件中,我们提议利用K-LITE(知识增强语言图像培训和评价)这一简单战略,利用外部知识建立可转移的视觉系统:在培训中,它用WordNet和Wiktional知识丰富自然语言实体,导致以有效和可扩缩的方法学习能够理解视觉概念及其知识的图像展示;在评价中,自然语言也随着外部知识的增强,然后用来参考已学的视觉概念(或描述新概念),以便能够对预先培训的模型进行零射和几发式转让。我们研究了K-LITE在两种重要的计算机视觉问题、图像分类和对象探测、对20和13种现有重要改进模式进行基准化的改进。