Multi-scale representations deeply learned via convolutional neural networks have shown tremendous importance for various pixel-level prediction problems. In this paper we present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion. In contrast to previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, and simply fusing the features with weighted averaging or concatenation, we propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner. In order to further improve the learning capacity of the network structure, we propose to exploit feature dependant conditional kernels within the deep probabilistic framework. Extensive experiments are conducted on four publicly available datasets (i.e. BSDS500, NYUD-V2, KITTI, and Pascal-Context) and on three challenging pixel-wise prediction problems involving both discrete and continuous labels (i.e. monocular depth estimation, object contour prediction, and semantic segmentation). Quantitative and qualitative results demonstrate the effectiveness of the proposed latent AG-CRF model and the overall probabilistic graph attention network with feature conditional kernels for structured feature learning and pixel-wise prediction.
翻译:在本文中,我们提出了一个新颖的方法,在基本方面,即结构化多尺度特征学习和聚合方面,提高像素级预测的先进水平,与以前直接考虑从主要CNN结构的内层获得的多尺度特征图的工程形成对比,仅仅以加权平均或混合的方式将特征叠叠起来,我们提议了一个概率图形关注网络结构,其基础是:一种新的关注-Gate定时随机字段(AG-CRFs)模型,用于学习和以原则方式使用多尺度表示;为了进一步提高网络结构的学习能力,我们提议在深度的概率性框架范围内利用基于地貌条件的多尺度特征图,对四个公开提供的数据集(即BSDS500模型、NYUD-V2、KITTI和Pascal-Context)进行广泛的实验,并针对三个具有挑战性的同质貌的预测问题,涉及离心和连续的预测结构性网络的特性、结构性、质量性指标性指标部分。