The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.
翻译:快速学习具有最起码指导的新任务的能力 — — 被称为微小的学习 — — 是智能剂的核心方面。 经典的微小基准利用了单一模式的少发样本, 但这类样本可能不足以描述整个概念类。 相反, 人类使用跨模式信息来有效地学习新概念。 在这项工作中, 我们证明, 一个人确实可以用$\bf读取狗和狗叫声等最差的狗, 并用$bf读取。 为了做到这一点, 我们利用了以下事实: 最近的多式基础模型, 如CLIP 本质上是跨模式的, 将不同模式映射到相同的代表空间。 具体地说, 我们建议一种简单的跨模式适应方法, 从几个例子中学习跨越不同模式的新概念。 通过将班级名称重新定位为额外的一发培训样本, 我们用一个难堪的简单音线性分类器来实现SOTA结果。 此外, 我们展示了我们的方法可以让现有的方法受益, 比如前的调法、 调整者、 调整者、 和 升级者 将不同模式的模型 改进我们的图像模式 来探索其它方法 。