Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$ order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block -- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2$^{nd}$ order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at \url{https://github.com/Panda-Peter/image-captioning}.
翻译:在细微视觉识别和视觉问题解答方面最近的进展包括双线式集合(Bilinear pooling),它有效地模拟了2$=nd}美元在多模式投入中的订单互动。然而,没有证据表明支持在图像说明的注意机制的同时建立这种互动。在本文件中,我们引入了一个统一的关注区块 -- -- X-Lineear 关注区块,充分使用双线集合来选择性地利用视觉信息或进行多模式推理。在技术上,X-Leararrow 关注区块同时利用空间和频道双线式双线式关注区块分布,以捕捉2$_nd}。尽管如此,在输入的单一模式或多模式特性之间,在建立这种互动的同时,也没有支持这种互动。通过堆叠多个 X-Leararear 关注区块,并分别以无参数的方式为该区块提供博览式线性单元。此外,我们将 X-Lear-Lear 关注网块的双线式双线式网络(dbbed ) 以新方式将X-Linalendoration Cal-creal-liver-crealheduder Studal-cal-deal-deal-deal demodududududustral deal deal deal deal degal degal degal degal deal deviews