Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes. Benchmarks, code, and models have been made available at: https://github.com/Vision-CAIR/LTVRR.
翻译:在最近的文献中提出了几种办法来缓解长期问题,主要是在目标分类任务方面。本文件首次对长期目视关系识别任务进行了大规模研究,目的是改进对长期目视关系识别任务(LTVRR)的结构性视觉关系的学习(例如“草地上放牧”);在这一设置中,主题、关系和对象类别每个类别都经过长尾分发。为了开始我们的研究并为社区制定未来基准,我们引入了两个与LTVRR有关的基准,称为VG8K-LT和GQA-LT,以广泛使用的视觉基因组和GQA数据集为基础。我们利用这些基准来研究长尾目关系(例如“草地上草地放牧”)。在这一设置中,主题、关系和对象类别都经过长尾分发。为了开始我们的研究,我们引入了两个称为VTVRRRR(VG8K-LT)和GQA-LT的基准相关基准基准,称为VGGG8K-LU和GQA-LMix。我们使用这些基准是为了研究几个基准来研究一些最先进的模型和最简单的模型,可以很容易地展示。