Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.
翻译:在多式学习中,关注网络为有选择地利用特定视觉信息提供了一种有效的方法。然而,学习每一对多式联运投入渠道的注意力分布的计算成本是极其昂贵的。为了解决这一问题,共同关注为每一种模式建立了两个单独的关注分布,忽略了多式投入之间的互动。在本文中,我们建议双线关注网络(BAN)找到双线关注分布,以便利用特定视觉语言信息无缝地进行。银行考虑两组输入渠道之间的双线互动,而低级双线联合体则提取每对渠道的联合代表。此外,我们提出了多式剩余网络的变式,以高效利用八度注意的泛式银行地图。我们从数量和质量上评价了我们的视觉问题回答模型(VQA2.0)和Flick30k实体数据集,表明BAN大大超前两种方法,在两个数据集上都实现了新的状态。