Extracting informative representations from videos is fundamental for the effective learning of various downstream tasks. Inspired by classical works on saliency, we present a novel information-theoretic approach to discover meaningful representations from videos in an unsupervised fashion. We argue that local entropy of pixel neighborhoods and its evolution in a video stream is a valuable intrinsic supervisory signal for learning to attend to salient features. We, thus, abstract visual features into a concise representation of keypoints that serve as dynamic information transporters. We discover in an unsupervised fashion spatio-temporally consistent keypoint representations that carry the prominent information across video frames, thanks to two original information-theoretic losses. First, a loss that maximizes the information covered by the keypoints in a frame. Second, a loss that encourages optimized keypoint transportation over time, thus, imposing consistency of the information flow. We evaluate our keypoint-based representation compared to state-of-the-art baselines in different downstream tasks such as learning object dynamics. To evaluate the expressivity and consistency of the keypoints, we propose a new set of metrics. Our empirical results showcase the superior performance of our information-driven keypoints that resolve challenges like attendance to both static and dynamic objects, and to objects abruptly entering and leaving the scene.
翻译:从视频中提取信息说明对于有效学习各种下游任务至关重要。在典型的显著特征作品的启发下,我们提出了一个新颖的信息理论方法,以不受监督的方式从视频中发现有意义的表述。我们争辩说,像素邻居的本地灵本及其在视频流中的演化是一个宝贵的内在监督信号,有助于学习如何关注突出特征。因此,抽象的视觉特征成为了作为动态信息传输工具的简明关键点的表述。我们在一个不受监督的时装时装时装时装中发现了一致的关键点表示,它携带着显著的信息跨视频框架。首先,由于两个原始的信息理论损失,这种损失使关键点覆盖在一个框架中的信息最大化。第二,这种损失鼓励了在时间上优化关键点的传输,从而使得信息流动具有一致性。我们评估了我们的关键点代表,与学习对象动态等不同下游任务中的最新基线相比。为了评估关键点的清晰度和一致性,我们提出了一套新的衡量标准,例如进入一个框架中的关键点,我们的经验性结果展示了我们进入了静止目标的快速度。