Adaptive sampling that exploits the spatiotemporal redundancy in videos is critical for always-on action recognition on wearable devices with limited computing and battery resources. The commonly used fixed sampling strategy is not context-aware and may under-sample the visual content, and thus adversely impacts both computation efficiency and accuracy. Inspired by the concepts of foveal vision and pre-attentive processing from the human visual perception mechanism, we introduce a novel adaptive spatiotemporal sampling scheme for efficient action recognition. Our system pre-scans the global scene context at low-resolution and decides to skip or request high-resolution features at salient regions for further processing. We validate the system on EPIC-KITCHENS and UCF-101 datasets for action recognition, and show that our proposed approach can greatly speed up inference with a tolerable loss of accuracy compared with those from state-of-the-art baselines. Source code is available in https://github.com/knmac/adaptive_spatiotemporal.
翻译:利用视频中短暂时间冗余的适应性抽样,对于在计算机和电池资源有限的可磨损装置上始终行动识别至关重要。常用的固定抽样战略不是符合环境需要的,而且可能低估视觉内容,从而对计算效率和准确性产生不利的影响。受视觉视觉和人类视觉感知机制的视觉前处理概念的启发,我们引入了一种新的适应性随机抽样计划,以有效行动识别。我们的系统先期扫描低分辨率的全球场景环境,并决定跳过或请求在突出区域进一步处理高分辨率特征。我们验证了EPIC-KITCHENS和UCF-101数据集的系统,以利行动识别,并表明我们提议的办法可以大大加快推论速度,与最先进的基线相比,准确性损失是可以容忍的。源代码见https://github.com/knmac/adptive_spotopatimal_spotopopalal。