This technical report extends our work presented in [9] with more experiments. In [9], we tackle long-term video understanding, which requires reasoning from current and past or future observations and raises several fundamental questions. How should temporal or sequential relationships be modelled? What temporal extent of information and context needs to be processed? At what temporal scale should they be derived? [9] addresses these questions with a flexible multi-granular temporal aggregation framework. In this report, we conduct further experiments with this framework on different tasks and a new dataset, EPIC-KITCHENS-100.
翻译:本技术报告以更多的实验扩展了我们在[9]中介绍的工作。在[9]中,我们讨论了长期的视频理解问题,这需要从当前和过去或未来的观察中推理,并提出了几个基本问题。应如何模拟时间关系或顺序关系?需要处理哪些时间范围的信息和背景?应当从何种时间范围得出?[9]以灵活的多层次时间汇总框架来处理这些问题。在本报告中,我们进一步试验这一框架,研究不同的任务和新的数据集,EPIC-KITCHENS-100。