Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams. Most state-of-the-art event-based methods project events into dense frames and process them with conventional learning models. However, these approaches sacrifice the sparsity and high temporal resolution of event data, resulting in a large model size and high computational complexity. To fit the sparse nature of events and sufficiently explore their implicit relationship, we develop a novel attention-aware framework named Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams. It first converts the event stream into a voxel set and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder to extract discriminative spatiotemporal features, which consists of two well-designed components, including a multi-scale neighbor embedding layer (MNEL) for local information aggregation and a voxel self-attention layer (VSAL) for global representation modeling. Enabling the framework to incorporate a long-term temporal structure, we introduce a segmental consensus strategy for modeling motion patterns over a sequence of segmented voxel sets. We evaluate the proposed framework on two event-based tasks: object classification and action recognition. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity. Additionally, we present a new dataset (NeuroHAR) recorded in challenging visual scenarios to address the lack of real-world event-based datasets for action recognition.
翻译:事件相机是神经形态的视觉传感器,它代表了稀有和不同步事件流的视觉信息。 多数最先进的事件基础方法将事件投射成密密密框架, 并用传统学习模型进行处理。 但是, 这些方法牺牲了事件数据的宽度和高时间分辨率, 从而产生了巨大的模型大小和高计算复杂性。 为了适应事件的稀疏性质并充分探索其隐含关系, 我们开发了一个新的关注觉框架, 名为“ 事件Voxel Set 变异器( EVSTr ) ”, 用于在事件流上进行闪烁式模拟。 它首先将事件流转换成一个 voxel 组合, 然后以分级的直观综合组合法变量特性进行处理。 然而, EVSTr的核心是一个事件变异变异器变异器变异器核心, 来提取具有歧视性的波形变异体特征, 包括两个设计完善的组件, 包括一个用于本地信息模型式地址汇总的多级邻居嵌入层嵌(MNEL) 以及一个基于 vox- 的自我感应变形图层( VSAL) 缺乏模型的天体) 目标模型模型。 使这个框架能够将一个长期的模型纳入一个长期时间结构, 我们引入的系统化数据序列序列序列的系统化动作识别序列的动作识别动作模型, 。 我们引入一个显示的动作动作动作的动作的动作的动作的动作的动作的动作的动作的动作的动作的动作的动作的动作的动作的动作的动作的动作模型, 。</s>