We propose ST-DETR, a Spatio-Temporal Transformer-based architecture for object detection from a sequence of temporal frames. We treat the temporal frames as sequences in both space and time and employ the full attention mechanisms to take advantage of the features correlations over both dimensions. This treatment enables us to deal with frames sequence as temporal object features traces over every location in the space. We explore two possible approaches; the early spatial features aggregation over the temporal dimension, and the late temporal aggregation of object query spatial features. Moreover, we propose a novel Temporal Positional Embedding technique to encode the time sequence information. To evaluate our approach, we choose the Moving Object Detection (MOD)task, since it is a perfect candidate to showcase the importance of the temporal dimension. Results show a significant 5% mAP improvement on the KITTI MOD dataset over the 1-step spatial baseline.
翻译:我们建议采用ST-DETR, 即基于空间- 时间变异器的Spatio- Temporal Temporal Tranger 结构, 用于从时间框架序列中探测物体。 我们把时间框架作为时空序列处理, 并使用充分注意的机制来利用两个维度的特征相关性。 这种处理让我们能够将框架序列作为时间天体在空间中每个位置的痕迹处理。 我们探索两种可能的方法: 时间维度的早期空间特征聚合, 以及时间空间空间特征的延迟时间聚合。 此外, 我们提出了一种新的时间定位嵌入技术来编码时间序列信息。 为了评估我们的方法, 我们选择移动天体探测( MOD), 因为它是展示时间维度重要性的完美候选者。 结果显示, KITTI MOD 数据集在1 步空间基线上有很大的5% mAP改进。