基于“搜-映射-搜”的动作识别帧选择方法 (Search-Map-Search: A Frame Selection Paradigm for Action Recognition)

Despite the success of deep learning in video understanding tasks, processing every frame in a video is computationally expensive and often unnecessary in real-time applications. Frame selection aims to extract the most informative and representative frames to help a model better understand video content. Existing frame selection methods either individually sample frames based on per-frame importance prediction, without considering interaction among frames, or adopt reinforcement learning agents to find representative frames in succession, which are costly to train and may lead to potential stability issues. To overcome the limitations of existing methods, we propose a Search-Map-Search learning paradigm which combines the advantages of heuristic search and supervised learning to select the best combination of frames from a video as one entity. By combining search with learning, the proposed method can better capture frame interactions while incurring a low inference overhead. Specifically, we first propose a hierarchical search method conducted on each training video to search for the optimal combination of frames with the lowest error on the downstream task. A feature mapping function is then learned to map the frames of a video to the representation of its target optimal frame combination. During inference, another search is performed on an unseen video to select a combination of frames whose feature representation is close to the projected feature representation. Extensive experiments based on several action recognition benchmarks demonstrate that our frame selection method effectively improves performance of action recognition models, and significantly outperforms a number of competitive baselines.

翻译：摘要：尽管深度学习在视频理解任务中取得了成功，但在实时应用中处理每一帧的计算量是昂贵的，并且常常是不必要的。帧选择旨在提取最具信息和代表性的帧，以帮助模型更好地理解视频内容。现有的帧选择方法要么根据每帧的重要性预测单独采样帧，而不考虑帧之间的交互，要么采用强化学习代理来依次找到具有代表性的帧，这些方法训练代价高，且可能导致潜在的稳定性问题。为了克服现有方法的局限性，我们提出了一种“搜-映射-搜”学习范式，它结合了启发式搜索和监督学习方法，以将视频中最佳帧组合选择为一个实体。通过结合搜索和学习，所提出的方法可以更好地捕捉帧之间的交互，并具有低推理开销。具体而言，在每个训练视频上，我们首先提出了一个分层搜索方法，搜索最佳的帧组合，以在下游任务中获得最低的误差。然后学习一个特征映射函数来将视频的帧映射到其目标最佳帧组合的表示。在推理过程中，对未知视频进行另一次搜索，以选择其特征表示与映射后的特征表示接近的帧组合。基于若干动作识别基准测试，进行的大量实验证明，所提出的帧选择方法有效地提高了动作识别模型的性能，并显著优于许多竞争基线。