Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. Two distinct challenges that remain however, are high sensitivity to the choice of handcrafted class names that define queries, and the difficulty of adaptation to new, smaller datasets. Towards addressing these problems, we propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names. We show that our solution can easily be integrated in image classification and object detection pipelines, yields significant performance gains in multiple scenarios and provides insights into model biases and labelling errors.
翻译:大规模的视觉和语言模型可以通过将类别特定的文本查询映射到图像内容来实现令人印象深刻的零样本识别性能。然而,仍然存在两个不同的挑战,即对手工制作类别名称的选择高度敏感,以及适应新的、较小的数据集的困难。为了解决这些问题,我们建议利用可用的数据为每个类学习最佳的词嵌入,作为视觉内容的函数。通过在一个冻结模型的基础上学习新的单词嵌入,我们能够保留新的课程的零样本能力,轻松地将模型适应新的数据集,并调整可能存在的错误、非描述性或模糊的类别名称。我们展示了我们的解决方案可以很容易地集成在图像分类和对象检测流水线中,在多个场景中产生显著的性能提升,并提供关于模型偏差和标记错误的见解。