查询相关视频表示用于时刻检索和重点检测 (Query-Dependent Video Representation for Moment Retrieval and Highlight Detection)

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

翻译：近年来，随着视频理解需求的急剧增加，视频时刻检索和重点检测(MR/HD)备受关注。MR/HD的主要目的是定位时刻并估计给定文本查询的剪辑级符合程度，即显著性分数。虽然最近的基于Transformer的模型带来了一些进展，但我们发现这些方法没有充分利用给定查询的信息。例如，有时会忽略文本查询和视频内容之间的相关性，当预测时刻及其显著性时。为了解决这个问题，我们引入了面向MR/HD的查询相关DETR(QD-DETR)，一种深度学习检测器。由于我们发现在Transformer架构中给定查询的作用微不足道，因此我们的编码模块从交叉注意力层开始，将文本查询的上下文明确注入到视频表示中。然后，为了增强模型利用查询信息的能力，我们操作视频-查询对，生成不相关的对。这些负（不相关的）视频-查询对被训练为生成低显著性分数，从而鼓励模型估计查询-视频对之间精确的符合程度。最后，我们提出了一种自适应显著性预测器，它自适应地定义给定视频-查询对的显著性分数标准。我们的广泛研究验证了为MR/HD构建查询相关表示的重要性。具体来说，QD-DETR在QVHighlights、TVSum和Charades-STA数据集上优于最先进的方法。代码可在github.com/wjun0830/QD-DETR上获取。