Contextual information plays an important role in action recognition. Local operations have difficulty to model the relation between two elements with a long-distance interval. However, directly modeling the contextual information between any two points brings huge cost in computation and memory, especially for action recognition, where there is an additional temporal dimension. Inspired from 2D criss-cross attention used in segmentation task, we propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range spatiotemporal contextual information in video for action recognition. The global context is factorized into sparse relation maps. We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure, and duplicate the same operation with recurrent mechanism to transmit the relation between points in a line to a plane finally to the whole spatiotemporal space. Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 30% for video context modeling. We evaluate the performance of RCCA-3D with two latest action recognition networks on three datasets and make a thorough analysis of the architecture, obtaining the optimal way to factorize and fuse the relation maps. Comparisons with other state-of-the-art methods demonstrate the effectiveness and efficiency of our model.
翻译:在行动识别中,背景信息具有重要作用。 本地操作很难在长距离间隔的两个元素之间建模关系。 但是, 直接建模两个点之间的背景信息在计算和记忆方面带来巨大的成本, 特别是对于行动识别而言, 在具有额外时间层面的情况下, 直接建模两个点之间的背景信息将带来巨大的计算和记忆成本, 在有额外时间层面的情况下, 受 2D 切片任务中使用的交叉关注的启发, 我们建议一个三维切片交叉关注模块( RCCA-3D ), 以模拟用于行动识别的视频中密集长距离波段背景信息。 将全球背景纳入稀薄关系图中。 我们按照地平线、 垂直和深度方向, 以及每个时间点的深度和深度, 来模拟同一线上各点之间的关系, 形成一个3D CRCA-3 交叉结构, 重复同一操作, 将线上各点与整个波场空间的连接机制。 与非本地方法相比, 拟议的RCCA-3D 模块将参数和FLOP数量减少25%和30%, 用于视频背景建模。 我们评估RACCA-3D的绩效和FI- 格式结构与两个最优化的模型关系,,, 以获取了我们最精确化的系统化的系统化的模型-, 和最精确化的模型 和最精确化的模型 3号,, 和最精确化的模型 。