This paper proposes a pre-trained neural network for handling event camera data. Our model is a self-supervised learning framework, and uses paired event camera data and natural RGB images for training. Our method contains three modules connected in a sequence: i) a family of event data augmentations, generating meaningful event images for self-supervised training; ii) a conditional masking strategy to sample informative event patches from event images, encouraging our model to capture the spatial layout of a scene and accelerating training; iii) a contrastive learning approach, enforcing the similarity of embeddings between matching event images, and between paired event and RGB images. An embedding projection loss is proposed to avoid the model collapse when enforcing the event image embedding similarities. A probability distribution alignment loss is proposed to encourage the event image to be consistent with its paired RGB image in the feature space. Transfer learning performance on downstream tasks shows the superiority of our method over state-of-the-art methods. For example, we achieve top-1 accuracy at 64.83% on the N-ImageNet dataset.
翻译:本文提出了一个用于处理事件相机数据的预训练神经网络。我们的模型是一个自监督学习框架,使用成对的事件相机数据和自然RGB图像进行训练。我们的方法包含三个按顺序连接的模块:i)一个事件数据增强方法族,生成有意义的事件图像进行自监督训练;ii) 一种条件性遮盖策略,从事件图像中抽样有信息量的事件块,鼓励我们的模型捕捉场景的空间布局并加快训练;iii)一种对比学习方法,强制匹配事件图像和匹配的事件与RGB图像之间的嵌入相似性。在强制事件图像嵌入相似性时,我们提出了一种嵌入投影损失,以避免模型崩溃。我们还提出了一种概率分布对齐损失,以鼓励事件图像在特征空间中与其配对的RGB图像一致。在下游任务的迁移学习性能方面,我们的方法优于最先进的方法。例如,在 N-ImageNet 数据集上,我们实现了64.83%的前1个准确度。