Convolutional neural networks have enabled major progresses in addressing pixel-level prediction tasks such as semantic segmentation, depth estimation, surface normal prediction and so on, benefiting from their powerful capabilities in visual representation learning. Typically, state of the art models integrate attention mechanisms for improved deep feature representations. Recently, some works have demonstrated the significance of learning and combining both spatial- and channelwise attentions for deep feature refinement. In this paper, weaim at effectively boosting previous approaches and propose a unified deep framework to jointly learn both spatial attention maps and channel attention vectors in a principled manner so as to structure the resulting attention tensors and model interactions between these two types of attentions. Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework, leading to VarIational STructured Attention networks (VISTA-Net). We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN frontend parameters. As demonstrated by our extensive empirical evaluation on six large-scale datasets for dense visual prediction, VISTA-Net outperforms the state-of-the-art in multiple continuous and discrete prediction tasks, thus confirming the benefit of the proposed approach in joint structured spatial-channel attention estimation for deep representation learning. The code is available at https://github.com/ygjwd12345/VISTA-Net.
翻译:电传神经网络使得在应对像素级预测任务方面取得重大进展,如语义分解、深度估计、表面正常预测等,并得益于其视觉代表学习的强大能力,在应对像素级预测任务方面取得重大进展。通常,先进模型的状况将关注机制整合在一起,以改善深度地貌表现;最近,一些工作表明学习和结合空间和渠道两方面的关注对于深层地貌改进的重要性。在本文件中,有效地推进以前的方法并提出一个统一的深层次框架,以原则方式共同学习空间关注地图和频道关注矢量,从而构建这两类关注之间的关注度和模型互动。具体地说,我们将各种关注的估算和互动纳入一个概率性代表学习框架,导致变异结构式调整关注网络(VISTA-Net);我们在神经网络中执行推断规则,从而能够从终端到终端学习稳妥性和CNN的前端值参数。正如我们对用于密集视觉预测的六种大规模数据集和模型之间的相互作用所显示的那样,在连续地图像预测中,VIST-Net 结构式的连续式空间代表方法将确认拟议的连续的连续的连续的学习规则。