Imagine trying to track one particular fruitfly in a swarm of hundreds. Higher biological visual systems have evolved to track moving objects by relying on both appearance and motion features. We investigate if state-of-the-art deep neural networks for visual tracking are capable of the same. For this, we introduce PathTracker, a synthetic visual challenge that asks human observers and machines to track a target object in the midst of identical-looking "distractor" objects. While humans effortlessly learn PathTracker and generalize to systematic variations in task design, state-of-the-art deep networks struggle. To address this limitation, we identify and model circuit mechanisms in biological brains that are implicated in tracking objects based on motion cues. When instantiated as a recurrent network, our circuit model learns to solve PathTracker with a robust visual strategy that rivals human performance and explains a significant proportion of their decision-making on the challenge. We also show that the success of this circuit model extends to object tracking in natural videos. Adding it to a transformer-based architecture for object tracking builds tolerance to visual nuisances that affect object appearance, resulting in a new state-of-the-art performance on the large-scale TrackingNet object tracking challenge. Our work highlights the importance of building artificial vision models that can help us better understand human vision and improve computer vision.
翻译:想象一下如何在成群成群的成群物中追踪某个特定的果蝇。 高级生物视觉系统已经通过依赖外观和运动特征来跟踪移动物体。 我们调查的是, 最先进的用于视觉跟踪的深神经网络是否同样能够运行。 为此, 我们引入了PathTracker, 这是一项合成视觉挑战, 要求人类观察者和机器在相同外观“ 吸引者” 物体中跟踪目标对象。 虽然人类不费力地学习路径跟踪器, 并广泛了解任务设计、 最先进的深层次网络斗争方面的系统变化。 为了应对这一限制, 我们发现并模拟生物大脑中的电路机制, 与根据运动提示跟踪物体的物体有关。 当我们作为经常性网络时, 我们的电路模型学习如何用强健健的视觉战略解决路径Tracker Tracker, 与人类业绩相对应, 并解释他们对挑战决策的很大一部分。 我们还表明, 这个电路模型的成功延伸到自然视频中的物体跟踪。 把它添加到一个基于变异的物体跟踪结构结构, 构建视觉的耐力, 构建一个影响着视觉的视觉, 影响着我们的大型目标的视觉, 跟踪, 。