The recent introduction of the large-scale long-form MAD dataset for language grounding in videos has enabled researchers to investigate the performance of current state-of-the-art methods in the long-form setup, with unexpected findings. In fact, current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this work, we propose an effective way to circumvent the long-form burden by introducing a new component to grounding pipelines: a Guidance model. The purpose of the Guidance model is to efficiently remove irrelevant video segments from the search space of grounding methods by coarsely aligning the sentence to chunks of the movies and then applying legacy grounding methods where high correlation is found. We term these video segments as non-describable moments. This two-stage approach reveals to be effective in boosting the performance of several different grounding baselines on the challenging MAD dataset, achieving new state-of-the-art performance.
翻译:最近为视频中的语言定位引入了大型长式 MAD 远程数据集,使研究人员能够调查当前在长式设置中最先进的方法的性能,并得出意想不到的结果。事实上,目前的定位方法本身无法应对这一具有挑战性的任务和设置,因为他们无法处理长式视频序列。在这项工作中,我们建议了一种有效的办法,通过在定位管道中引入一个新的部件来规避长式负担:一个指导模型。指导模型的目的是通过粗略地将句子与片段相匹配,然后在发现高度相关性的地方应用遗留的定位方法,从而有效地将无关的视频段从定位方法的搜索空间移走。我们将这些视频段称为不可描述的时刻。这一两阶段方法表明,有效提高了挑战式 MAD 数据集上几个不同的基底基线的性能,从而实现了新的最新性性能。</s>