Large Vision & Language models pretrained on web-scale data provide representations that are invaluable for numerous V&L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language. This problem arises in multiple domains, from personalized image retrieval to personalized interaction with smart devices. We introduce a new learning setup called Personalized Vision & Language (PerVL) with two new benchmark datasets for retrieving and segmenting user-specific "personalized" concepts "in the wild". In PerVL, one should learn personalized concepts (1) independently of the downstream task (2) allowing a pretrained model to reason about them with free language, and (3) does not require personalized negative examples. We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts. The model can then reason about them by simply using them in a sentence. We demonstrate that our approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation using rich textual queries.
翻译:在网络规模数据上预先培训的大型视觉和语言模型提供了对于许多V & L 问题非常宝贵的说明。 但是, 不清楚它们如何能够用于解释非结构化语言中用户特有的视觉概念。 这个问题出现在多个领域, 从个性化图像检索到与智能设备的个人互动。 我们引入了一个新的学习设置, 名为个性化视觉和语言( PerVL ), 包含两个新的基准数据集, 用于检索和分割“ 野外” 用户特有的“ 个性化” 概念。 在 PerVL 中, 人们应该学习个性化概念 (1), 独立于下游任务(2) 允许一个预先训练的模型以自由语言解释这些概念, 以及 (3) 不需要个性化负面实例。 我们建议了一个解决 PerVL 的架构, 其操作方式是扩展一个预先训练的模型的输入词汇, 将新的个人化概念嵌入新的词。 然后, 只需在句子中使用它们来解释这些概念。 我们证明我们的方法从几个例子中学习个性化的视觉概念, 可以有效地应用这些概念在图像检索和语义断段。