We focus on tackling weakly supervised semantic segmentation with scribble-level annotation. The regularized loss has been proven to be an effective solution for this task. However, most existing regularized losses only leverage static shallow features (color, spatial information) to compute the regularized kernel, which limits its final performance since such static shallow features fail to describe pair-wise pixel relationship in complicated cases. In this paper, we propose a new regularized loss which utilizes both shallow and deep features that are dynamically updated in order to aggregate sufficient information to represent the relationship of different pixels. Moreover, in order to provide accurate deep features, we adopt vision transformer as the backbone and design a feature consistency head to train the pair-wise feature relationship. Unlike most approaches that adopt multi-stage training strategy with many bells and whistles, our approach can be directly trained in an end-to-end manner, in which the feature consistency head and our regularized loss can benefit from each other. Extensive experiments show that our approach achieves new state-of-the-art performances, outperforming other approaches by a significant margin with more than 6\% mIoU increase.
翻译:我们的重点是处理监管不力的语义分解,并用书写层次的注解。常规化损失已证明是这项任务的有效解决办法。然而,大多数常规化损失仅利用静态浅质特征(彩色、空间信息)来计算正常化内核,这限制了其最后性能,因为这种静态浅质特征无法描述复杂情况下的双向像素关系。在本文件中,我们提出了一种新的常规化损失,利用浅色和深层特征进行动态更新,以汇总足够的信息来代表不同像素的关系。此外,为了提供准确的深层特征,我们采用视觉变异器作为主干线,并设计一个特征一致性头来培训对称特征关系。与大多数采用多阶段培训战略的方法不同,许多钟声和哨声都采用多阶段培训策略,我们的方法可以以端对端方式直接培训,使特征一致性头和我们正常化损失相互受益。广泛的实验表明,我们的方法取得了新的状态,能够以比6毫升的幅度高出其他方法。