We propose an approach to semantic segmentation that achieves state-of-the-art supervised performance when applied in a zero-shot setting. It thus achieves results equivalent to those of the supervised methods, on each of the major semantic segmentation datasets, without training on those datasets. This is achieved by replacing each class label with a vector-valued embedding of a short paragraph that describes the class. The generality and simplicity of this approach enables merging multiple datasets from different domains, each with varying class labels and semantics. The resulting merged semantic segmentation dataset of over 2 Million images enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets, despite not using any images therefrom. By fine-tuning the model on standard semantic segmentation datasets, we also achieve a significant improvement over the state-of-the-art supervised segmentation on NYUD-V2 and PASCAL-context at 60% and 65% mIoU, respectively. Based on the closeness of language embeddings, our method can even segment unseen labels. Extensive experiments demonstrate strong generalization to unseen image domains and unseen labels, and that the method enables impressive performance improvements in downstream applications, including depth estimation and instance segmentation.
翻译:我们建议一种方法,在零射场应用时实现最先进的、受监督的性能的语义分解,从而在不进行有关这些数据集的培训的情况下,在每一个主要语义分解数据集中取得与受监督方法相同的结果,不进行关于这些数据集的培训,从而在每一个类别标签上替换一个具有矢量价值的嵌入一个描述该类的短段落的标签。这一方法的笼统性和简洁性使得能够将不同域的多个数据集(每个域都有不同的等级标签和语义学)合并在一起。由此产生的200多万图象的混合语义分解数据集使得能够培训一种模型,在7个基准数据集上达到与受监督方法相同的结果,尽管没有使用这些数据集中的任何图像。通过对标准语义分解数据集模型进行微调,我们还大大改进了NUDUD-V2和PASAL-text的多功能分解分解,分别将60%和65% mIoU纳入。根据语言嵌入的近距离性语言嵌化,包括高超度的图像,我们可视路段的标签,甚至展示了整个图像。