Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel conditional global feature which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the conditional global feature. The conditional global feature can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image caption. Experiments show that our proposed conditional attention model achieves the best performance on the SVHN (Street View House Numbers) dataset with / without extra bounding box; and for image caption, our attention model generates better scores than the popular soft attention model.
翻译:连续视觉任务通常需要关注其以先前的观测为条件的当前相关对象。 不同于大众软关注机制, 我们提出一个新的关注框架, 引入一个新的有条件的全球特征, 代表当前焦点对象的薄弱特征描述符。 具体来说, 对于标准的CNN( 革命神经网络) 管道, 使用具有不同接收域的革命层来制作关注图, 测量共进特征如何与有条件的全球特征相匹配。 有条件的全球特征可以由不同的常规结构根据不同的视觉任务生成, 如用于多对象识别的简单经常性神经网络, 或用于图像说明的中度复杂语言模型。 实验显示, 我们提议的有条件关注模型在 SVHN( 街道浏览房屋数字) 数据集上取得了最佳的性能, 并且没有附加额外的框条框; 对于图像说明, 我们的注意模型比流行的软关注模型产生更好的分数 。