Data augmentation has recently emerged as an essential component of modern training recipes for visual recognition tasks. However, data augmentation for video recognition has been rarely explored despite its effectiveness. Few existing augmentation recipes for video recognition naively extend the image augmentation methods by applying the same operations to the whole video frames. Our main idea is that the magnitude of augmentation operations for each frame needs to be changed over time to capture the real-world video's temporal variations. These variations should be generated as diverse as possible using fewer additional hyper-parameters during training. Through this motivation, we propose a simple yet effective video data augmentation framework, DynaAugment. The magnitude of augmentation operations on each frame is changed by an effective mechanism, Fourier Sampling that parameterizes diverse, smooth, and realistic temporal variations. DynaAugment also includes an extended search space suitable for video for automatic data augmentation methods. DynaAugment experimentally demonstrates that there are additional performance rooms to be improved from static augmentations on diverse video models. Specifically, we show the effectiveness of DynaAugment on various video datasets and tasks: large-scale video recognition (Kinetics-400 and Something-Something-v2), small-scale video recognition (UCF- 101 and HMDB-51), fine-grained video recognition (Diving-48 and FineGym), video action segmentation on Breakfast, video action localization on THUMOS'14, and video object detection on MOT17Det. DynaAugment also enables video models to learn more generalized representation to improve the model robustness on the corrupted videos.
翻译:增加数据最近成为视觉识别任务的现代培训配方的一个基本组成部分。然而,尽管其有效性是有效的,但很少探索为视频识别而增加数据的问题。现有的视频识别增强配方很少通过将同样的操作应用到整个视频框架来天真地扩展图像增强方法。我们的主要想法是,每个框架的增强行动的规模需要随着时间的推移而改变,以捕捉真实世界视频的时间变异。这些变异应当尽可能地利用培训中增加的超参数来产生。我们通过这一动机,建议一个简单而有效的视频数据增强框架,即DynaAugment。每个框架的增强操作规模通过一个有效的机制来改变。Fourier的放大使图像增强方法具有多样性、光滑和现实的时间变异性。我们的主要想法是,每个框架的放大操作规模需要随着时间的变化而改变。我们的主要想法是,每个框架的扩增规模需要随着时间的变化,以捕捉真实世界视频的视频变异性。我们提议一个简单而有效的视频数据增强框架(Dynalviaug)的体现效果:大规模视频识别(Kinetical-F4)和Scial-Scial-Scial-Scialation-Scialation-Scialation-Scialation-Scialation-Scialation-Scialation-Scialation-Scial-Science-Science-Sy-Sy-Science-B-Science-Science-Science-B-Science-Science-Scial-Scial-Scialviviview)。