When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection.
翻译:在通过相机或图片与对象进行互动时,用户往往有特定意图。例如,他们可能想要进行视觉搜索。然而,大多数对象探测模型忽略了用户意图,只依靠图像像素作为唯一的输入。这往往导致错误的结果,例如,在感兴趣的对象上缺乏高度自信的检测,或用错误的分类标签进行检测。在本文中,我们调查调整标准物体探测器的技术,以明确说明用户意图,表现为嵌入简单的查询。与标准对象探测器相比,查询调控探测器显示在为特定对象标签探测对象方面的优异性能。由于从标准对象探测说明中合成的大规模培训数据,查询调控探测器还可以超越专门引用表达识别系统的性能。此外,还可以同时训练它们为调试检测和标准对象探测进行解析。