Autonomous driving has attracted much attention over the years but turns out to be harder than expected, probably due to the difficulty of labeled data collection for model training. Self-supervised learning (SSL), which leverages unlabeled data only for representation learning, might be a promising way to improve model performance. Existing SSL methods, however, usually rely on the single-centric-object guarantee, which may not be applicable for multi-instance datasets such as street scenes. To alleviate this limitation, we raise two issues to solve: (1) how to define positive samples for cross-view consistency and (2) how to measure similarity in multi-instance circumstances. We first adopt an IoU threshold during random cropping to transfer global-inconsistency to local-consistency. Then, we propose two feature alignment methods to enable 2D feature maps for multi-instance similarity measurement. Additionally, we adopt intra-image clustering with self-attention for further mining intra-image similarity and translation-invariance. Experiments show that, when pre-trained on Waymo dataset, our method called Multi-instance Siamese Network (MultiSiam) remarkably improves generalization ability and achieves state-of-the-art transfer performance on autonomous driving benchmarks, including Cityscapes and BDD100K, while existing SSL counterparts like MoCo, MoCo-v2, and BYOL show significant performance drop. By pre-training on SODA10M, a large-scale autonomous driving dataset, MultiSiam exceeds the ImageNet pre-trained MoCo-v2, demonstrating the potential of domain-specific pre-training. Code will be available at https://github.com/KaiChen1998/MultiSiam.
翻译:多年来,自主驱动吸引了许多注意力,但结果却比预期的要难得多,原因可能是用于模型培训的标签化数据收集很难。自我监督学习(SSL)将非标签化数据用于模拟学习,这也许是改进模型性能的一个大有希望的方法。但现有的SSL方法通常依赖单一中心点目标保证,而这种保证可能不适用于像街头场景这样的多系统数据集。为了减轻这一限制,我们提出了两个需要解决的问题:(1)如何定义用于交叉视图一致性的多功能样本,以及(2)如何测量多系统环境下的相似性。我们首先在随机裁剪时采用IOU阈值阈值,将全球不一致性转移到本地一致性。然后,我们提出两种功能调整方法,使2D特征图能够用于多功能性测量。此外,我们采用自我保护的图像内集,以进一步挖掘类似图像/翻译性能。实验显示,在对驱动数据设置前,我们所使用的方法是多功能-移动系统(MOV-S-S-S-SDR-C-C-DV-DV-DV-SL-SL-SL-C-SDV-SDV-SDV-SL-SL-S-S-S-SDV-SDV-SD-SDV-SL-SL-SL-SD-SL-SD-SD-SD-SD-S-S-S-S-SD-SL-S-S-S-S-S-S-SD-SD-S-S-SD-SD-SD-SD-SD-SD-SD-SD-SD-S-S-SL-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SD-SD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-