Non-local (NL) block is a popular module that demonstrates the capability to model global contexts. However, NL block generally has heavy computation and memory costs, so it is impractical to apply the block to high-resolution feature maps. In this paper, to investigate the efficacy of NL block, we empirically analyze if the magnitude and direction of input feature vectors properly affect the attention between vectors. The results show the inefficacy of softmax operation which is generally used to normalize the attention map of the NL block. Attention maps normalized with softmax operation highly rely upon magnitude of key vectors, and performance is degenerated if the magnitude information is removed. By replacing softmax operation with the scaling factor, we demonstrate improved performance on CIFAR-10, CIFAR-100, and Tiny-ImageNet. In Addition, our method shows robustness to embedding channel reduction and embedding weight initialization. Notably, our method makes multi-head attention employable without additional computational cost.
翻译:非本地( NL) 区块是一个热门模块, 显示模拟全球背景的能力。 但是, NL 区块通常有沉重的计算和记忆成本, 因此将区块应用到高分辨率地貌图是不切实际的。 在本文中, 为了调查NL 区块的功效, 我们用实验分析输入特性矢量的大小和方向是否适当地影响矢量之间的注意。 结果显示软式最大操作的不起作用, 通常用来将NL 区块的注意地图正常化。 注意与软式最大操作正常化的地图高度依赖关键矢量的大小, 如果删除了该级量信息, 性能就会退化。 通过用缩放系数取代软式最大操作, 我们展示了在CIFAR- 10、 CIFAR- 100 和 Tiny- ImageNet 上的改进性能。 此外, 我们的方法显示嵌入频道减少和嵌入权重初始化的稳健性。 值得注意的是, 我们的方法使得多头关注在不增加计算成本的情况下可以使用。