This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition to this, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results. Our code is available at https://github.com/facebookresearch/vq2d_cvpr.
翻译:本文涉及从视觉示例中定位图像和视频数据集中的对象的问题。特别地,我们着眼于自我中心视觉查询定位这个具有挑战性的问题。我们首先确定了当前查询条件模型设计和视觉查询数据集中的严重内在偏差。然后,我们直接解决了帧和对象集合级别上的这种偏差问题。具体而言,我们通过扩展有限的注释并在训练过程中动态删除对象提议来解决这些问题。此外,我们提出了一种新颖的基于Transformer的模块,允许在考虑查询信息的同时考虑对象提议集上下文。我们将我们的模块命名为条件上下文Transformer或CocoFormer。我们的实验表明,所提出的适应改进了自我中心查询检测,从而在2D和3D配置中实现了更好的视觉查询定位系统。因此,我们能够将帧级别的检测性能从26.28%提高到31.26%的AP,相应地显著提高了VQ2D和VQ3D的本地化分数。我们改进的上下文感知查询对象检测器在第2届Ego4D挑战赛的VQ2D和VQ3D任务中排名第一和第二。除此之外,我们展示了我们提出的模型在Few-Shot Detection(FSD)任务中的相关性,在这里我们也取得了最新的结果。我们的代码可在https://github.com/facebookresearch/vq2d_cvpr中获得。