This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use and leverage semantic compositionality. We verify the model's performance on two different tasks of identifying the targets of referring expressions, where it has to learn new language use. The results show that the model can efficiently learn and generalize from only a few examples, with little interference with the model's original zero-shot performance.
翻译:本文展示了COLLIE: 持续学习语言基于愿景的简单而有效的模式。 在经过预先训练的多式嵌入模型中,语言和图像被投射在同一语义空间(在此情况下,由OpenAI提供 CLIP),COLLIE学习了一种转换功能,根据需要调整语言嵌入功能以适应新的语言使用。这是通过预测需要应用的不同矢量以及该矢量的缩放系数来完成的,因此调整只在需要的时候应用。与传统的少发式学习不同,该模型不仅学习新类和标签,还可以概括类似的语言使用,并利用语义构成性。我们验证了该模型在确定表达目标的两种不同任务上的性能,即它需要学习新的语言使用。结果显示,该模型能够有效地学习和概括仅从几个例子中学习,而很少干扰模型的原始零发性表现。