Human activities can be learned from video. With effective modeling it is possible to discover not only the action labels but also the temporal structures of the activities such as the progression of the sub-activities. Automatically recognizing such structure from raw video signal is a new capability that promises authentic modeling and successful recognition of human-object interactions. Toward this goal, we introduce Asynchronous-Sparse Interaction Graph Networks (ASSIGN), a recurrent graph network that is able to automatically detect the structure of interaction events associated with entities in a video scene. ASSIGN pioneers learning of autonomous behavior of video entities including their dynamic structure and their interaction with the coexisting neighbors. Entities' lives in our model are asynchronous to those of others therefore more flexible in adaptation to complex scenarios. Their interactions are sparse in time hence more faithful to the true underlying nature and more robust in inference and learning. ASSIGN is tested on human-object interaction recognition and shows superior performance in segmenting and labeling of human sub-activities and object affordances from raw videos. The native ability for discovering temporal structures of the model also eliminates the dependence on external segmentation that was previously mandatory for this task.
翻译:人类活动可以从视频中学习。 通过有效的模拟,不仅可以发现动作标签,还可以发现活动的时间结构,例如次级活动的进展。 原始视频信号自动识别这种结构是一种新能力,保证真实的模型和成功承认人类物体相互作用。 为了实现这一目标,我们引入了Asyncronous-Sparse互动图表网络(ASSIGIN),这是一个经常性的图形网络,能够自动检测与视频场中实体相关的互动事件结构。 ASSIGN先驱学习视频实体的自主行为,包括其动态结构及其与共存邻居的相互作用。我们模型中的实体在适应复杂情景方面与其他实体生活过于紧张,因此在适应复杂情景方面更加灵活。他们的互动在时间上是稀少的,因此更忠实于真实的基本性质,更有力地推断和学习。 ASSIGINT是一个经常性的图形网络,可以自动检测人- 目标互动认识,并显示在分解和标定人类次活动与对象的高级性功能,从原始视频中显示出其优越性能表现。 发现模型的时间结构的本地能力也消除了以前对外部任务部分的依赖性。