Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers, which requires repetitive training operations and multiplying storage costs. If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly (see Fig.1), which is summarized as Temporal Frequency Deviation phenomenon. To fix this issue, we propose a general framework, named Frame Flexible Network (FFN), which not only enables the model to be evaluated at different frames to adjust its computation, but also reduces the memory costs of storing multiple models significantly. Concretely, FFN integrates several sets of training sequences, involves Multi-Frequency Alignment (MFAL) to learn temporal frequency invariant representations, and leverages Multi-Frequency Adaptation (MFAD) to further strengthen the representation abilities. Comprehensive empirical validations using various architectures and popular benchmarks solidly demonstrate the effectiveness and generalization of FFN (e.g., 7.08/5.15/2.17% performance gain at Frame 4/8/16 on Something-Something V1 dataset over Uniformer). Code is available at https://github.com/BeSpontaneous/FFN.
翻译:现有的视频识别算法往往针对不同帧数的输入进行不同的训练流程,这需要重复的训练操作和大量的存储成本。如果我们使用未用于训练的其他帧进行模型评估,我们会观察到性能会显著下降(见图1),这被总结为时间频率偏差现象。为了解决这个问题,我们提出了一个通用的框架,称为Flexible Frame Network (FFN),它不仅能够在不同的帧上评估模型以调整其计算,而且还显著减少了存储多个模型的内存成本。具体而言,FFN集成了几个训练序列,并利用多频对齐(Multi-Frequency Alignment,MFAL)学习时间频率不变表示,利用多频度自适应(Multi-Frequency Adaptation,MFAD)进一步加强表示能力。全面的实证验证使用各种体系结构和流行的基准坚实地证明了FFN的有效性和泛化性 (例如,在Something-Something V1数据集上,与Uniformer相比,Frame 4/8/16时,性能提高了7.08/5.15/2.17%)。代码可在https://github.com/BeSpontaneous/FFN上获得。