This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a query sentence, the goal is to recognize and determine temporal boundaries of action instances in the video described by the provided natural language queries. Recent works solve this task by directly encoding the query using large pre-trained language models (PLM). However, isolating the effects of the improved language representations is difficult, as these works also propose improvements in the visual inputs. Furthermore, these PLMs significantly increase the computational cost of training TVG models. Therefore, this paper studies the effects of PLMs in the TVG task and assesses the applicability of NLP parameter-efficient training alternatives based on adapters. We couple popular PLMs with a selection of existing approaches and test different adapters to reduce the impact of the additional parameters. Our results on three challenging datasets show that TVG models could greatly benefit from PLMs when these are fine-tuned for the task and that adapters are an effective alternative to full fine-tuning, even though they are not tailored for our task. Concretely, adapters helped save on computational cost, allowing PLM integration in larger TVG models and delivering results comparable to the state-of-the-art models. Finally, through benchmarking different types of adapters in TVG, our results shed light on what kind of adapters work best for each studied case.
翻译:本文探讨TVG的任务,在TVG任务中,根据未剪辑的视频和询问句,目标是在提供自然语言查询的视频描述的视频中确认和确定行动实例的时间范围。最近的工作通过使用大型预先培训语言模型直接对询问进行编码,从而解决这一问题。然而,由于这些工程还提议改进视觉投入,因此很难分离改进语言表现的影响。此外,这些PLM公司大大增加了培训TVG模型的计算成本。因此,本文研究TVG任务中PLM公司的影响,并评估NLP参数高效培训替代软件在适应器上的适用性。我们把广受欢迎的PLM公司与选择现有方法并测试不同的适应器以减少额外参数的影响。我们在三个具有挑战性的数据集中得出的结果表明,如果对任务进行微调,TVG模型可以极大地受益于PLM公司,而且适应者是全面微调的有效替代方法,即使这些模型没有适合我们的任务。具体地说,适应者帮助节省了计算成本,使PLPLP的参数高效培训替代软件在适应者身上实现计算成本,使PLM公司在更大程度的每类TVG基准模型中能够对每类进行最精确的调整。