Human vision possesses a special type of visual processing systems called peripheral vision. Partitioning the entire visual field into multiple contour regions based on the distance to the center of our gaze, the peripheral vision provides us the ability to perceive various visual features at different regions. In this work, we take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition. We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data. We evaluate the proposed network, dubbed PerViT, on ImageNet-1K and systematically investigate the inner workings of the model for machine perception, showing that the network learns to perceive visual data similarly to the way that human vision does. The performance improvements in image classification over the baselines across different model sizes demonstrate the efficacy of the proposed method.
翻译:人类视觉拥有一种特殊类型的视觉处理系统,称为外围视觉。将整个视觉场分割成基于距离我们凝视中心的多轮廓区域,外围视觉使我们有能力感知不同区域的各种视觉特征。在这项工作中,我们采取生物启发的方法,探索在深神经网络中模拟外围视觉,以进行视觉识别。我们提议在多头自省层中加入外围位置编码,让网络学习将视觉场分隔于不同的周边区域,并获得培训数据。我们根据图像Net-1K评估了拟议的网络,称为PerViT,并系统调查机器感知模型的内部工作,显示网络学会以与人类视觉相似的方式看待视觉数据。在不同模型大小的基线上图像分类的性能改进显示了拟议方法的功效。