This paper proposes a pre-trained neural network for handling event camera data. Our model is trained in a self-supervised learning framework, and uses paired event camera data and natural RGB images for training. Our method contains three modules connected in a sequence: i) a family of event data augmentations, generating meaningful event images for self-supervised training; ii) a conditional masking strategy to sample informative event patches from event images, encouraging our model to capture the spatial layout of a scene and fast training; iii) a contrastive learning approach, enforcing the similarity of embeddings between matching event images, and between paired event-RGB images. An embedding projection loss is proposed to avoid the model collapse when enforcing event embedding similarities. A probability distribution alignment loss is proposed to encourage the event data to be consistent with its paired RGB image in feature space. Transfer performance in downstream tasks shows superior performance of our method over state-of-the-art methods. For example, we achieve top-1 accuracy at 64.83\% on the N-ImageNet dataset.
翻译:本文提出了处理事件相机数据的培训前神经网络。 我们的模型是在一个自我监督的学习框架内培训的, 并且使用对齐的事件相机数据和自然 RGB 图像来进行培训。 我们的方法包含三个模块, 在一个序列中连接了三个模块 : (一) 事件数据增强系列, 产生有意义的事件图像供自我监督的培训 ; (二) 一个从事件图像中抽取信息事件补丁的有条件的遮罩策略, 鼓励我们的模型来捕捉场景的空间布局和快速培训 ; (三) 一个对比式的学习方法, 强化匹配事件图像和对齐事件 RGB 图像之间的嵌入相似性 。 提议嵌入式预测损失, 以避免模型在强化事件嵌入相似性时崩溃 。 提议概率分布匹配损失, 以鼓励事件数据与其在功能空间的对齐的 RGB 图像保持一致 。 下游任务传输性表现显示我们的方法优于状态和快速培训方法。 例如, 我们在N- ImageNet 数据集上在64.83 上达到顶级1 精确度 。