Devising intelligent agents able to live in an environment and learn by observing the surroundings is a longstanding goal of Artificial Intelligence. From a bare Machine Learning perspective, challenges arise when the agent is prevented from leveraging large fully-annotated dataset, but rather the interactions with supervisory signals are sparsely distributed over space and time. This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream. The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations. Spatio-temporal stochastic coherence along the attention trajectory, paired with a contrastive term, leads to an unsupervised learning criterion that naturally copes with the considered setting. Differently from most existing works, the learned representations are used in open-set class-incremental classification of each frame pixel, relying on few supervisions. Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream. Inheriting features from state-of-the art models is not as powerful as one might expect.
翻译:设计能够生活在环境中并能通过观察周围环境来学习的智能剂是人工智能的长期目标。 从光机学习的角度来看,当该剂无法利用大量附有充分注释的数据集时,就会出现一些挑战。 但是,与监督信号的相互作用在空间和时间上分散分布。 本文提出一种新的神经网络方法,在视频流中逐步和自主地开发像素表达方式。 提议的方法基于一种人性式的注意机制,使该剂能够通过观察被观察地点移动的东西来学习。 Spatio- 时空同步在关注轨迹上的一致性,加上一个对比的术语,导致一种不受监督的学习标准,自然地适应所考虑的设置。 与大多数现有的工程不同, 所学的表达方式用于每个框架像素的开放定级分级分类, 依赖很少的监管。 我们的实验利用 3D 虚拟环境, 并且它们表明, 拟议的代理人可以通过观察视频流来学习辨别对象。 从状态艺术模型中获取的外观特征并不强大。