This technical report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022. MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description. The biggest challenge of this task is the fi ne-grained video-text semantics of make-up steps. However, current methods mainly extract video features using action-based pre-trained models. As actions are more coarse-grained than make-up steps, action-based features are not sufficient to provide fi ne-grained cues. To address this issue,we propose to achieve fi ne-grained representation via exploiting feature diversities. Specifically, we proposed a series of methods from feature extraction, network optimization, to model ensemble. As a result, we achieved 3rd place in the MTVG competition.
翻译:这份技术报告介绍了MTVG的第三次获胜解决方案,这是在ACM M M M 2022 的第四人(PIC)背景挑战中引入的一项新任务。MTVG旨在根据文字描述,将未剪辑的视频片段的时间界限本地化,这项任务的最大挑战是化妆步骤的fu negrane-text语义。然而,目前的方法主要是使用基于行动、预先培训的模式提取视频特征。由于行动比编造的步骤更粗糙,基于行动的特点不足以提供非纯的提示。为了解决这一问题,我们提议通过利用特征多样性实现非纯度代表。具体地说,我们提出了从特征提取、网络优化到模型组合的一系列方法。结果,我们在MTVG竞争中取得了第3个位置。