Pixel space augmentation has grown in popularity in many Deep Learning areas, due to its effectiveness, simplicity, and low computational cost. Data augmentation for videos, however, still remains an under-explored research topic, as most works have been treating inputs as stacks of static images rather than temporally linked series of data. Recently, it has been shown that involving the time dimension when designing augmentations can be superior to its spatial-only variants for video action recognition. In this paper, we propose several novel enhancements to these techniques to strengthen the relationship between the spatial and temporal domains and achieve a deeper level of perturbations. The video action recognition results of our techniques outperform their respective variants in Top-1 and Top-5 settings on the UCF-101 and the HMDB-51 datasets.
翻译:在许多深层学习领域,由于效率高、简单简便和计算成本低,像素空间扩增已在许多深层学习领域受到欢迎。但是,视频数据扩增仍然是一个探索不足的研究专题,因为大多数作品一直将投入视为静态图像堆积,而不是与时间相联系的数据系列。最近,人们已经表明,设计扩增时涉及的时间层面可能优于其仅用空间变体来识别视频动作。在本文中,我们建议对这些技术进行一些新的改进,以加强空间和时间域之间的关系,并实现更深的扰动程度。我们技术的视频动作识别结果超过了UCF-101和HMDB-51数据集上层和上层-5环境中各自的变体。