In this work, we investigate the problem of sketch-based object localization on natural images, where given a crude hand-drawn sketch of an object, the goal is to localize all the instances of the same object on the target image. This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap existing between the sketches and the natural images. To mitigate these challenges, existing works proposed attention-based frameworks to incorporate query information into the image features. However, in these works, the query features are incorporated after the image features have already been independently learned, leading to inadequate alignment. In contrast, we propose a sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features leading to stronger alignment with the query sketch. Further, at the output of the decoder, the object and the sketch features are refined to bring the representation of relevant objects closer to the sketch query and thereby improve the localization. The proposed model also generalizes to the object categories not seen during training, as the target image features learned by our method are query-aware. Our localization framework can also utilize multiple sketch queries via a trainable novel sketch fusion strategy. The model is evaluated on the images from the public object detection benchmark, namely MS-COCO, using the sketch queries from QuickDraw! and Sketchy datasets. Compared with existing localization methods, the proposed approach gives a $6.6\%$ and $8.0\%$ improvement in mAP for seen objects using sketch queries from QuickDraw! and Sketchy datasets, respectively, and a $12.2\%$ improvement in AP@50 for large objects that are `unseen' during training.
翻译:在这项工作中,我们调查了自然图像基于素描的素描目标本地化问题,考虑到一个对象的粗略手工绘制草图,目标是将目标图像上同一对象的所有实例本地化。由于手画草图的抽象性质、素描风格和质量的变异,以及草图和自然图像之间存在的巨大域间差距,这个问题证明很困难。为了减轻这些挑战,现有的基于关注的框架建议将查询信息纳入图像特征。然而,在这些工程中,在图像特征已经独立学习后,查询功能被纳入本地化,导致不完全的对齐。相比之下,我们提议在基于手画的素描图草图的每个街区后使用交叉使用的素描的视觉变异变变变器编码器。此外,在变色素的输出中,对象和素描写功能将相关对象更接近素描写改进,从而改进本地化。 提议的素描写图谱变色标本的模型,也就是通过S-ROD的图谱变色变色图案,在培训期间使用我们的新版图解中,可以使用新图解的图谱变缩方法。</s>