Self-supervised feature learning enables perception systems to benefit from the vast amount of raw data being recorded by vehicle fleets all over the world. However, their potential to learn dense representations from sequential data has been relatively unexplored. In this work, we propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks. We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for instance-level perception architectures, and formulate the sequential ordering prediction by comparing similarities between sets of feature vectors in a transformer-based multi-frame architecture. Extensive evaluation in automated driving domains on the BDD100K and MOT17 datasets shows that our TempO approach outperforms existing self-supervised single-frame pre-training methods as well as supervised transfer learning initialization strategies on standard object detection and multi-object tracking benchmarks.
翻译:自我监督的特征学习使感知系统能够从世界各地车队记录的大量原始数据中受益,然而,它们从连续数据中了解密集表示的潜力相对而言尚未开发。在这项工作中,我们提议TEMO,这是培训前地区一级特征表现的暂时性托辞任务,用于对感知任务进行培训前的区域一级特征表现。我们用一套未经排序的提案将每个框架嵌入一组特性矢量,这是一种自然的象素级感知结构,并通过比较以变压器为基础的多框架结构中各组特性矢量之间的相似性来制定顺序排序预测。对BDD100K和MOT17数据集自动驱动域的广泛评价表明,我们的TEMO方法比现有的自我监督的单一框架前训练方法以及受监督的关于标准物体探测和多点跟踪基准的初始化战略。