Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos. Second, the scene representations of individual frames in video should "align" with those of still images. This is achieved via a \emph{Frame-Clip Consistency} loss, which ensures the flow of structured information between images and videos. We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges. SViT shows strong performance improvements on multiple video understanding tasks and datasets; and it wins first place in the Ego4D CVPR'22 Object State Localization challenge. For code and pretrained models, visit the project page at \url{https://eladb3.github.io/SViT/}
翻译:最近的行动识别模型通过整合对象、其位置和互动取得了令人印象深刻的成果。 然而, 获得每个框架的密集结构化说明既乏味又耗时, 使得这些方法在培训上花费很多, 并且降低可缩放性。 同时, 如果在感兴趣的领域内外有一组小的附加说明的图像, 我们如何利用这些图像进行视频下游任务? 我们建议了一个学习框架 结构ViT (SViT for short), 它展示了如何利用仅培训期间提供的少量图像的结构来改进视频模型。 SViT 依靠两个关键洞察力。 首先, 由于图像和视频都包含结构化的信息, 我们用一套可跨图像和视频使用的\emph{object 符号来丰富一个变动器模型。 其次, 视频中个人框架的场景展示应该“ 与静止图像的相匹配 ” 。 这是通过一个 emph{Frame- C- clipscentrence} 损失, 保证图像和视频中结构化信息的流流流流流流 。 我们探索一个特殊的瞬间结构结构结构结构结构结构结构,,, 将Sem- hex- 和Sliffactal- pridealals relational 和S- slations relationslations