The main streams of human activity recognition (HAR) algorithms are developed based on RGB cameras which are suffered from illumination, fast motion, privacy-preserving, and large energy consumption. Meanwhile, the biologically inspired event cameras attracted great interest due to their unique features, such as high dynamic range, dense temporal but sparse spatial resolution, low latency, low power, etc. As it is a newly arising sensor, even there is no realistic large-scale dataset for HAR. Considering its great practical value, in this paper, we propose a large-scale benchmark dataset to bridge this gap, termed HARDVS, which contains 300 categories and more than 100K event sequences. We evaluate and report the performance of multiple popular HAR algorithms, which provide extensive baselines for future works to compare. More importantly, we propose a novel spatial-temporal feature learning and fusion framework, termed ESTF, for event stream based human activity recognition. It first projects the event streams into spatial and temporal embeddings using StemNet, then, encodes and fuses the dual-view representations using Transformer networks. Finally, the dual features are concatenated and fed into a classification head for activity prediction. Extensive experiments on multiple datasets fully validated the effectiveness of our model. Both the dataset and source code will be released on \url{https://github.com/Event-AHU/HARDVS}.
翻译:人类活动识别(HAR)算法的主要流是建立在受光照、快速运动、隐私保存和大量能源消耗影响的RGBX摄像头基础上的。同时,生物激发的事件摄像头因其独特的特点,例如动态范围高、时间宽度低、空间分辨率低、悬浮度低、功率低等等而引起极大兴趣。由于它是新产生的传感器,即使HAR也没有现实的大型数据集。考虑到其巨大的实际价值,我们在本文件中提议建立一个大型基准数据集,以弥合这一差距,称为HARDVS, 其中包括300个类别和100K事件序列。我们评估和报告多个流行的HAR算法的性能,这些算法为未来工作的比较提供了广泛的基线。更重要的是,我们提议建立一个新的空间时空特征学习和融合框架,称为ESTF,用于基于人类活动流的识别。首先,我们用StemNet,然后,编码和连接和连接使用变换者网络的双视图显示器序列。最后,两个双重特征是用于对数据进行多重测试和升级的源码化。关于我们的数据的模型,将充分测试。