In this work, we develop a prompting approach for incremental summarization of task videos. We develop a sample-efficient few-shot approach for extracting semantic concepts as an intermediate step. We leverage an existing model for extracting the concepts from the images and extend it to videos and introduce a clustering and querying approach for sample efficiency, motivated by the recent advances in perceiver-based architectures. Our work provides further evidence that an approach with richer input context with relevant entities and actions from the videos and using these as prompts could enhance the summaries generated by the model. We show the results on a relevant dataset and discuss possible directions for the work.
翻译:在这项工作中,我们为任务视频的递增总结制定了一种快速方法。我们开发了一种提取语义概念的样本高效的短片方法,作为中间步骤。我们利用现有模式从图像中提取概念,并将其推广到视频中,并引入了以基于感知结构的最新进展为动力的样本效率分组和查询方法。我们的工作提供了进一步的证据,证明与相关实体一起投入更丰富的方法以及视频行动,并将其作为提示,可以加强模型产生的摘要。我们展示了相关数据集的结果,并讨论了工作可能的方向。</s>