A visual hard attention model actively selects and observes a sequence of subregions in an image to make a prediction. The majority of hard attention models determine the attention-worthy regions by first analyzing a complete image. However, it may be the case that the entire image is not available initially but instead sensed gradually through a series of partial observations. In this paper, we design an efficient hard attention model for classifying such sequentially observed scenes. The presented model never observes an image completely. To select informative regions under partial observability, the model uses Bayesian Optimal Experiment Design. First, it synthesizes the features of the unobserved regions based on the already observed regions. Then, it uses the predicted features to estimate the expected information gain (EIG) attained, should various regions be attended. Finally, the model attends to the actual content on the location where the EIG mentioned above is maximum. The model uses a) a recurrent feature aggregator to maintain a recurrent state, b) a linear classifier to predict the class label, c) a Partial variational autoencoder to predict the features of unobserved regions. We use normalizing flows in Partial VAE to handle multi-modality in the feature-synthesis problem. We train our model using a differentiable objective and test it on five datasets. Our model gains 2-10% higher accuracy than the baseline models when both have seen only a couple of glimpses.
翻译:视觉硬关注模型在图像中积极选择和观察一系列次区域以图像进行预测。 大多数硬关注模型首先分析完整图像,然后通过分析完整图像,确定值得关注的区域。不过,可能整个图像最初不可用,而是通过一系列部分观察逐渐感知。在本文中,我们设计了一种高效的硬关注模型,对按顺序观察的场景进行分类。演示模型从不完全观察图像。在部分可观察性之下选择信息丰富的区域,模型使用巴耶西亚最佳实验设计。首先,它根据已经观测的区域综合了未观测区域的特点。然后,它利用预测的特征来估计预期获得的信息(EIG),如果有多个区域参加的话。最后,模型将关注上述EIG的最大位置的实际内容。模型使用一个经常性的特征聚合器来维持一个经常性状态,b) 模型只能预测等级标签,c) 部分变异性自动编码,以预测未观测的更高区域的特点。我们用的是,在目标2-AE中,我们用正常的轨道流来测试我们两个目标的VE的多式模型。我们用一个不同的模型处理一个不同的数据。