With significant annotation savings, point supervision has been proven effective for numerous 2D and 3D scene understanding problems. This success is primarily attributed to the structured output space; i.e., samples with high spatial affinity tend to share the same labels. Sharing this spirit, we study affordance segmentation with point supervision, wherein the setting inherits an unexplored dual affinity-spatial affinity and label affinity. By label affinity, we refer to affordance segmentation as a multi-label prediction problem: A plate can be both holdable and containable. By spatial affinity, we refer to a universal prior that nearby pixels with similar visual features should share the same point annotation. To tackle label affinity, we devise a dense prediction network that enhances label relations by effectively densifying labels in a new domain (i.e., label co-occurrence). To address spatial affinity, we exploit a Transformer backbone for global patch interaction and a regularization loss. In experiments, we benchmark our method on the challenging CAD120 dataset, showing significant performance gains over prior methods.
翻译:在许多2D和3D场景理解问题中,点监督已被证明可以显著降低注释成本并获得良好的效果。这种成功主要归因于结构化输出空间,即高空间相似性的样本往往具有相同的标签。在这个基础上,我们研究带点监督的物体作用区分问题,其中设置继承了一个未经探索的双重亲缘性-空间亲缘性和标签亲缘性。这里,标签亲缘性指物体作用区分作为一个多标签预测问题:例如一块盘子既可以被拿起来,也可以被放置其他物品。空间亲缘性的存在意味着具有相似视觉特征的附近像素应该具有相同的点注释。为了解决标签亲缘性,我们设计了一个稠密预测网络,在一个新的域(即标签共现)中有效地增强了标签之间的关系。为了解决空间亲缘性,我们利用了Transformer骨干网络进行全局补丁交互和正则化损失。实验中,我们在具有挑战性的CAD120数据集上对我们的方法进行了基准测试,展示了比之前方法显著的性能提升。