We present a new algorithm for selection of informative frames in video action recognition. Our approach is designed for aerial videos captured using a moving camera where human actors occupy a small spatial resolution of video frames. Our algorithm utilizes the motion bias within aerial videos, which enables the selection of motion-salient frames. We introduce the concept of patch mutual information (PMI) score to quantify the motion bias between adjacent frames, by measuring the similarity of patches. We use this score to assess the amount of discriminative motion information contained in one frame relative to another. We present an adaptive frame selection strategy using shifted leaky ReLu and cumulative distribution function, which ensures that the sampled frames comprehensively cover all the essential segments with high motion salience. Our approach can be integrated with any action recognition model to enhance its accuracy. In practice, our method achieves a relative improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone, and 9.0% on Diving48 datasets.
翻译:我们提出了一种新的用于视频动作识别中选择信息帧的算法。我们的方法针对使用移动摄像头拍摄的空中视频,其中人类演员占据了视频帧的很小空间分辨率。我们的算法利用了空中视频中的运动偏差,从而能够选择具有高运动显著性的帧。我们引入了补丁互信息(PMI)分数的概念来量化相邻帧之间的运动偏差,通过测量补丁的相似性来评估其中一个帧相对于另一个帧包含的区分性运动信息的数量。我们使用移位渗漏整流线性单元(shifted leaky ReLu)和累积分布函数(cumulative distribution function)的自适应帧选择策略,以确保采样的帧全面覆盖所有具有高运动显著性的重要片段。我们的方法可以与任何动作识别模型集成以提高其准确性。在实践中,我们的方法在UAV-Human、NEC Drone和Diving48数据集上的top-1准确度相对提高了2.2-13.8%、6.8%和9.0%。