Cognitive science has shown that humans perceive videos in terms of events separated by the state changes of dominant subjects. State changes trigger new events and are one of the most useful among the large amount of redundant information perceived. However, previous research focuses on the overall understanding of segments without evaluating the fine-grained status changes inside. In this paper, we introduce a new dataset called Kinetic-GEB+. The dataset consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos. Upon this new dataset, we propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes. We evaluate many representative baselines in our dataset, where we also design a new TPD (Temporal-based Pairwise Difference) Modeling method for visual difference and achieve significant performance improvements. Besides, the results show there are still formidable challenges for current methods in the utilization of different granularities, representation of visual difference, and the accurate localization of status changes. Further analysis shows that our dataset can drive developing more powerful methods to understand status changes and thus improve video level comprehension. The dataset is available at https://github.com/Yuxuan-W/GEB-Plus
翻译:认知科学显示,人类对视频的感知是因主导主题的状态变化而分离的事件。 国家变化引发了新的事件, 并且是最有用的大量多余信息之一。 然而, 先前的研究侧重于对各部分的整体理解, 而没有评估内部微细的状态变化。 在本文中, 我们引入了一个新的数据集, 名为“ 动画- GEB+ ” 。 数据集包含170公里以上的界限, 与描述12K 视频中一般事件状态变化的字幕相关。 在这一新数据集中, 我们提出三项任务, 支持通过状态变化对视频形成更精细、 强大和 人性化的理解。 我们评估了我们数据集中许多具有代表性的基线, 其中我们还设计了一个新的 TPD( 以时间为基础的彩色差异) 模型, 用于视觉差异并实现显著的性能改进。 此外, 结果表明, 使用不同颗粒特性、 视觉差异的表示和 状态变化的准确本地化等当前方法仍面临巨大的挑战。 进一步的分析显示, 我们的数据设置可以推动制定更强有力的方法, 来理解状态变化/ 。 在 MAGI/ QUL 。