Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search process more efficient. However, few works have combined both target and context information in visual search computational models. Here we propose a zero-shot deep learning architecture, TCT (Target and Context-aware Transformer), that modulates self attention in the Vision Transformer with target and contextual relevant information to enable human-like zero-shot visual search performance. Target modulation is computed as patch-wise local relevance between the target and search images, whereas contextual modulation is applied in a global fashion. We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty. TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks. Importantly, TCT generalizes well across datasets with novel objects without retraining or fine-tuning. Furthermore, we also introduce a new dataset to benchmark models for invariant visual search under incongruent contexts. TCT manages to search flexibly via target and context modulation, even under incongruent contexts.
翻译:视觉搜索是自然视觉中普遍存在的挑战,包括日常任务,如在人群中寻找朋友或在停车场寻找汽车。人类严重依赖相关目标特征来进行定向视觉搜索。与此同时,环境对于在复杂场景中定位目标对象至关重要,因为它有助于缩小搜索区域范围,使搜索过程更有效率。然而,在视觉搜索计算模型中,很少有作品将目标和背景信息结合起来。在这里,我们提出了一个零点深层次的深层次学习结构,TCT(Torget和Economic-aware变异器),在视野变异器中将自我关注与目标和背景相关的信息相调和,以使像人一样的零点视觉搜索性搜索性工作得以进行。目标变异性在目标和搜索图像之间计算出对地方的适切性,而环境调适中。我们在三个自然场搜索计算模型中进行视觉搜索实验和其他有竞争力的视觉搜索模型。TCT在搜索效率方面展示人性化表现,并在视觉搜索任务中击败STOT模型,甚至具有挑战性搜索背景。TI、TCT在视觉搜索背景下对数据进行新的搜索,我们又将数据进行新的搜索,在新的搜索背景中将新的数据库管理,然后将数据引入。