We propose a framework to continuously learn object-centric representations for visual learning and understanding. Existing object-centric representations either rely on supervisions that individualize objects in the scene, or perform unsupervised disentanglement that can hardly deal with complex scenes in the real world. To mitigate the annotation burden and relax the constraints on the statistical complexity of the data, our method leverages interactions to effectively sample diverse variations of an object and the corresponding training signals while learning the object-centric representations. Throughout learning, objects are streamed one by one in random order with unknown identities, and are associated with latent codes that can synthesize discriminative weights for each object through a convolutional hypernetwork. Moreover, re-identification of learned objects and forgetting prevention are employed to make the learning process efficient and robust. We perform an extensive study of the key features of the proposed framework and analyze the characteristics of the learned representations. Furthermore, we demonstrate the capability of the proposed framework in learning representations that can improve label efficiency in downstream tasks. Our code and trained models will be made publicly available.
翻译:我们建议一个框架,不断学习以物体为中心的表达方式,以便进行视觉学习和理解。现有的以物体为中心的表达方式,要么依靠监督,使现场的物体个性化,要么在没有监督的情况下进行无法处理现实世界中复杂场景的分解;为了减轻批注负担,减轻数据统计复杂性的限制,我们的方法利用相互作用,有效地抽样一个物体的不同变化和相应的培训信号,同时学习以物体为中心的表达方式。在整个学习过程中,物体按随机顺序逐个流出,具有未知身份,并且与通过共生超网络合成每个物体的区别重量的潜在代码相关。此外,还利用重新确定学到的物体和忘记预防来提高学习过程的效率和力度。我们广泛研究了拟议框架的关键特征,分析了所学的表述方式的特点。此外,我们展示了拟议框架在学习表达方式方面的能力,可以提高下游任务中的标签效率。我们的代码和经过培训的模式将被公诸于众。