We present an approach for building an active agent that learns to segment its visual observations into individual objects by interacting with its environment in a completely self-supervised manner. The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels. The model learned from over 50K interactions generalizes to novel objects and backgrounds. To deal with noisy training signal for segmenting objects obtained by self-supervised interactions, we propose robust set loss. A dataset of robot's interactions along-with a few human labeled examples is provided as a benchmark for future research. We test the utility of the learned segmentation model by providing results on a downstream vision-based control task of rearranging multiple objects into target configurations from visual inputs alone. Videos, code, and robotic interaction dataset are available at https://pathak22.github.io/seg-by-interaction/
翻译:我们提出了一个方法,用于建立一个积极的代理机构,该代理机构通过以完全自我监督的方式与环境进行互动,将视觉观测分解成单个物体。该代理机构使用其目前的分解模型来推断构成物体的分解像素,并通过与这些像素进行互动来改进分解模型。从50K互动中学到的模型向新对象和背景进行概括。为了处理通过自我监督的相互作用获得的分解物体的吵闹的培训信号,我们提议进行稳健的设定损失。机器人相互作用的数据集与几个人类标签的例子一起作为未来研究的基准。我们测试了所学的分解模型的效用,方法是提供基于视觉的下游控制任务的结果,即仅通过视觉投入将多个物体重新排列成目标配置。视频、代码和机器人互动数据集可在https://pathak22.github.io/seg-by-interaction/上查阅。