Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides us the ability to perceive various visual features at different regions. In this work, we take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition. We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data. We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception, showing that the network learns to perceive visual data similarly to the way that human vision does. The state-of-the-art performance in image classification task across various model sizes demonstrates the efficacy of the proposed method.
翻译:人类视觉拥有一种特殊类型的视觉处理系统,称为外围视觉。将整个视觉场分割成基于与我们视线中心距离的多重轮廓区域,外围视觉使我们有能力感知不同区域的各种视觉特征。在这项工作中,我们采取生物启发的方法,探索在深神经网络中模拟外围视觉,以进行视觉识别。我们提议将外围位置编码纳入多头自省层,让网络学习将视觉场分隔于不同的周边区域,并获得培训数据。我们评估了大规模图像网络数据集的拟议网络,并系统地调查机器感知模型的内部功能,显示网络学会如何以与人类视觉相似的方式看待视觉数据。不同模型大小图像分类工作的最新表现显示了拟议方法的功效。