Language-driven action localization in videos is a challenging task that involves not only visual-linguistic matching but also action boundary prediction. Recent progress has been achieved through aligning language query to video segments, but estimating precise boundaries is still under-explored. In this paper, we propose entity-aware and motion-aware Transformers that progressively localizes actions in videos by first coarsely locating clips with entity queries and then finely predicting exact boundaries in a shrunken temporal region with motion queries. The entity-aware Transformer incorporates the textual entities into visual representation learning via cross-modal and cross-frame attentions to facilitate attending action-related video clips. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales via integrating long short-term memory into the self-attention module to further improve the precision of action boundary prediction. Extensive experiments on the Charades-STA and TACoS datasets demonstrate that our method achieves better performance than existing methods.
翻译:视频中语言驱动行动本地化是一项艰巨的任务,不仅涉及视觉语言匹配,而且涉及行动边界预测。最近的进展是通过将语言查询与视频段相匹配而取得的,但估计准确的边界仍然未得到充分探讨。在本文件中,我们建议实体认知和运动觉悟变异器通过先粗略地查找带有实体查询的剪辑,然后通过运动查询细微地预测一个尖锐的时空区域的确切边界,逐步将视频中的行动本地化。实体认知变异器通过跨模式和跨框架的注意将文字实体纳入视觉表述学习,以便利参加与行动有关的视频剪辑。运动觉变异器通过将长期短期记忆纳入提高行动边界预测精确度的自我意识模块,在多个时间尺度上捕捉微小的动作变化。关于Charades-STA和TACoS数据集的广泛实验表明,我们的方法比现有方法取得更好的业绩。