We address the problem of capturing temporal information for video classification in 2D networks, without increasing computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that without bells and whistles, the proposed sampling strategy improves performance on multiple architectures (e.g. TSN, TRN, and TSM) and datasets (CATER, Something-Something-V1 and V2), up to 24% over the baseline of using the standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community. Code is available at https://github.com/kiyoon/PyVideoAI.
翻译:现有方法侧重于修改2D网络的结构(例如将过滤器纳入时间层面,将过滤器转化为3D网络,或使用光学流等),从而增加计算成本;相反,我们建议采用新的抽样战略,重新订购输入视频的渠道,以捕捉短期框架到框架的变化;我们注意到,如果没有钟声和哨声,拟议的取样战略将提高多个结构(例如TSN、TRN和TSM)和数据集(CATER、Somes-Something-V1和V2)的性能,最高可超过使用标准视频投入基线的24%;此外,我们的取样战略不需要从零到增加培训,也不增加培训和测试的计算成本;鉴于结果的一般性和方法的灵活性,我们希望这能够对视频理解界产生广泛帮助。 法规可在https://github.com/kiyoon/PyVideoAI上查阅。