Collecting annotated data for semantic segmentation is time-consuming and hard to scale up. In this paper, we for the first time propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets. The highlight is that the annotations from different domains can be efficiently reused and consistently boost performance for each specific domain. This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets regardless of their taxonomy labels, and followed by fine-tuning the pretrained model over specific dataset as usual. In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing and propose a pixel-to-class sparse coding strategy that explicitly models the pixel-class similarity over the manifold embedding space. In this way, we are able to increase intra-class compactness and inter-class separability, as well as considering inter-class similarity across different datasets for better transferability. Experiments conducted on several benchmarks demonstrate its superior performance. Notably, MDP consistently outperforms the pretrained models over ImageNet by a considerable margin, while only using less than 10% samples for pretraining.
翻译:收集语义分解附加说明的数据耗时且难于扩大。 在本文中, 我们首次提出一个统一框架, 称为多数据预设预设, 以充分利用不同数据集的零散说明。 突出的一点是, 不同域的注释可以高效再利用, 并持续提高每个特定域的性能。 这是通过拟议的像素到原型对多个数据集的对比性损失来首先对网络进行预演, 不论其分类标签如何, 并随后对特定数据集的预设模型进行与通常一样的微调。 为了更好地建模不同数据集的图像和类别之间的关系, 我们通过交叉数据集混合来扩展像素水平嵌入, 并提出像素到阶级的稀释编码战略, 明确模拟像素级与多个嵌入空间的相似性。 这样, 我们就可以提高类内压缩和类间分离性能, 以及考虑不同数据集之间的类间相似性能, 更好地建模, 我们通过连续的模型来扩展像素级水平, 测试前的模型, 而不是持续地展示前的模型, 。