Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
翻译:检索增强模型在计算机视觉任务中变得越来越流行,这是在自然语言处理问题中最近获得成功后的结果。目标是通过从外部数据集中为视觉输入检索类似的示例来增强模型的识别能力。在这项工作中,我们引入了一种基于注意力的记忆模块,它学习了来自记忆中的每个检索示例的重要性。与现有方法相比,我们的方法消除了不相关的检索示例的影响,并保留了对输入查询有益的那些示例。我们还彻底研究了构建记忆数据集的各种方法。我们的实验表明,使用1B个图像-文本对的大规模存储数据集的好处,并展示了不同存储表示的性能。我们在三个不同的分类任务中评估了我们的方法,即长尾识别,噪声标签学习和细粒度分类,并展示了它在ImageNet-LT,Places-LT和WebVision数据集中达到了最先进的准确度。