开放现场精美图像识别空间时时关注网络 (Spatial-Temporal Attention Network for Open-Set Fine-Grained Image Recognition)

Triggered by the success of transformers in various visual tasks, the spatial self-attention mechanism has recently attracted more and more attention in the computer vision community. However, we empirically found that a typical vision transformer with the spatial self-attention mechanism could not learn accurate attention maps for distinguishing different categories of fine-grained images. To address this problem, motivated by the temporal attention mechanism in brains, we propose a spatial-temporal attention network for learning fine-grained feature representations, called STAN, where the features learnt by implementing a sequence of spatial self-attention operations corresponding to multiple moments are aggregated progressively. The proposed STAN consists of four modules: a self-attention backbone module for learning a sequence of features with self-attention operations, a spatial feature self-organizing module for facilitating the model training, a spatial-temporal feature learning module for aggregating the re-organized features via a Long Short-Term Memory network, and a context-aware module that is implemented as the forget block of the spatial-temporal feature learning module for preserving/forgetting the long-term memory by utilizing contextual information. Then, we propose a STAN-based method for open-set fine-grained recognition by integrating the proposed STAN network with a linear classifier, called STAN-OSFGR. Extensive experimental results on 3 fine-grained datasets and 2 coarse-grained datasets demonstrate that the proposed STAN-OSFGR outperforms 9 state-of-the-art open-set recognition methods significantly in most cases.

翻译：由于变压器成功完成各种视觉任务,空间自留机制最近吸引了计算机视觉界越来越多的注意力。然而,我们从经验中发现,一个带有空间自留机制的典型视觉变压器无法为区分细微图像的不同类别而学习精确的注意地图。为了解决这一问题,在大脑时间关注机制的推动下,我们提议建立一个空间时钟关注网络,以学习细微的特征表现,称为STAN,通过实施与多个时刻相对应的空间自留操作序列所学习的特征正在逐步汇总。拟议的STAN由四个模块组成:一个自留式主干模模块,用于学习一系列自留式操作的特征序列;一个用于促进模型培训的空间自留式自留式模块;一个空间时钟感学习模块,用于通过长期短期记忆网络整合重新组织特征,以及一个背景意识模块,作为空间-时序特征学习模块的空白块,用于保存/放弃长期自留式OS的精密记忆,通过使用背景信息,将空间自留式自动自留置主干主干,然后我们提议将STAN-TRA系统数据库数据库数据库数据库中的拟议直径直径识别系统识别系统。我们提议的STRA系统数据库演示系统,然后提议将S-Sty-Sty-Stystr-Sty-STRSTRS-S-STR-S-S-STR-STR-S-S-S-S-S-S-S-S-GRA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-