Event cameras offer the capacity to asynchronously capture brightness changes with low latency, high temporal resolution, and high dynamic range. Deploying deep learning methods for classification or other tasks to these sensors typically requires large labeled datasets. Since the amount of labeled event data is tiny compared to the bulk of labeled RGB imagery, the progress of event-based vision has remained limited. To reduce the dependency on labeled event data, we introduce Masked Event Modeling (MEM), a self-supervised pretraining framework for events. Our method pretrains a neural network on unlabeled events, which can originate from any event camera recording. Subsequently, the pretrained model is finetuned on a downstream task leading to an overall better performance while requiring fewer labels. Our method outperforms the state-of-the-art on N-ImageNet, N-Cars, and N-Caltech101, increasing the object classification accuracy on N-ImageNet by 7.96%. We demonstrate that Masked Event Modeling is superior to RGB-based pretraining on a real world dataset.
翻译:活动相机能够以低延迟度、高时间分辨率和高动态范围来不同步地捕捉亮度变化。 为这些传感器部署分类或其他任务的深学习方法通常需要大量标签数据集。 由于标签事件数据的数量与标签的 RGB 图像大部分相比很小,以事件为基础的视觉进展仍然有限。 为减少对标签事件数据的依赖,我们引入了蒙面事件模型(MEM),这是一个自我监督的事件预培训框架。我们的方法预设了一个无标签事件神经网络,它可以来自任何事件相机记录。随后,预先培训的模型在下游任务上进行微调,导致总体更好的性能,同时需要较少的标签。我们的方法比N-ImaageNet、N-Cars和N-Caltech101 上的最新工艺水平要差,将N-ImaageNet上的对象分类精确度提高7.96%。我们证明,蒙面事件模型比真实世界数据集上基于RGB的预培训要高。