We introduce a new architecture for unsupervised object-centric representation learning and multi-object detection and segmentation, which uses a translation-equivariant attention mechanism to predict the coordinates of the objects present in the scene and to associate a feature vector to each object. A transformer encoder handles occlusions and redundant detections, and a convolutional autoencoder is in charge of background reconstruction. We show that this architecture significantly outperforms the state of the art on complex synthetic benchmarks.
翻译:我们引入了不受监督的以物体为中心的代言学习和多物体探测和分割的新架构,该架构使用翻译等量关注机制来预测现场物体的坐标,并将特性矢量与每个物体联系起来。变压器编码器处理隔热和多余的检测,而进化自动编码器负责背景重建。我们显示,这一架构在复杂的合成基准方面大大优于最新水平。