人类草图在目标检测中能起到什么作用？ (What Can Human Sketches Do for Object Detection?)

from arxiv, Accepted as Top 12 Best Papers. Will be presented in special single-track plenary sessions to all attendees in Computer Vision and Pattern Recognition (CVPR), 2023. Project Page: www.pinakinathc.me/sketch-detect

Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues. The exploration of such innate properties of human sketches has, however, been limited to that of image retrieval. In this paper, for the first time, we cultivate the expressiveness of sketches but for the fundamental vision task of object detection. The end result is a sketch-enabled object detection framework that detects based on what \textit{you} sketch -- \textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of zebras (instance-aware detection), and only the \textit{part} (e.g., ``head" of a ``zebra") that you desire (part-aware detection). We further dictate that our model works without (i) knowing which category to expect at testing (zero-shot) and (ii) not requiring additional bounding boxes (as per fully supervised) and class labels (as per weakly supervised). Instead of devising a model from the ground up, we show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR), which can already elegantly solve the task -- CLIP to provide model generalisation, and SBIR to bridge the (sketch$\rightarrow$photo) gap. In particular, we first perform independent prompting on both sketch and photo branches of an SBIR model to build highly generalisable sketch and photo encoders on the back of the generalisation ability of CLIP. We then devise a training paradigm to adapt the learned encoders for object detection, such that the region embeddings of detected boxes are aligned with the sketch and photo embeddings from SBIR. Evaluating our framework on standard object detection datasets like PASCAL-VOC and MS-COCO outperforms both supervised (SOD) and weakly-supervised object detectors (WSOD) on zero-shot setups. Project Page: \url{https://pinakinathc.github.io/sketch-detect}

翻译：草图是高度表现力的，固有地捕捉主观和细微的视觉线索。然而，研究人员至今只将人类草图的这些内在特性应用于图像检索方面。本文首次探索了草图表达的局限性，将其应用于目标检测这个具有基础性的视觉任务。最终实现了基于草图的目标检测框架，能依据你所画的特定“斑马”（例如，正在吃草的那只）在斑马群体中进行实例感知检测，在你所需的部分上（例如，斑马的“头部”）进行部位感知检测。我们进一步指出，我们的模型在测试时不需要知道期望检测类别（零样本学习），也不需要额外的边界框（如全监督学习）和类别标签（如弱监督学习）。我们展示了卓越的直觉，即采用基础模型（例如 CLIP）和已有的用于基于草图的图像检索的草图模型之间的协同效应，这些模型已经可以优雅地解决该任务。CLIP 提供模型泛化，而草图检索模型则用于弥合草图到照片之间的差距。特别的，我们首先对草图和照片分支进行独立提示，以在 CLIP 的泛化能力支持下构建高度泛化的草图和照片编码器。然后，我们设计了一个训练范例，以使从 SBIR 学习的编码器能够适应目标检测任务，从而使检测出的边界框的区域嵌入与来自 SBIR 的草图嵌入和照片嵌入相匹配。在标准的目标检测数据集（如 PASCAL-VOC 和 MS-COCO）上评估我们的框架，在零样本设置下表现优于监督式（SOD）和弱监督式目标检测器（WSOD）。项目页面：\url{https://pinakinathc.github.io/sketch-detect}