Models for semantic segmentation require a large amount of hand-labeled training data which is costly and time-consuming to produce. For this purpose, we present a label fusion framework that is capable of improving semantic pixel labels of video sequences in an unsupervised manner. We make use of a 3D mesh representation of the environment and fuse the predictions of different frames into a consistent representation using semantic mesh textures. Rendering the semantic mesh using the original intrinsic and extrinsic camera parameters yields a set of improved semantic segmentation images. Due to our optimized CUDA implementation, we are able to exploit the entire $c$-dimensional probability distribution of annotations over $c$ classes in an uncertainty-aware manner. We evaluate our method on the Scannet dataset where we improve annotations produced by the state-of-the-art segmentation network ESANet from $52.05 \%$ to $58.25 \%$ pixel accuracy. We publish the source code of our framework online to foster future research in this area (\url{https://github.com/fferflo/semantic-meshes}). To the best of our knowledge, this is the first publicly available label fusion framework for semantic image segmentation based on meshes with semantic textures.
翻译:用于语义分解的模型需要大量手工标签培训数据,这些数据成本高,耗时费时。 为此,我们提出了一个标签聚合框架, 能够以不受监督的方式改进视频序列的语义像标签标签。 我们使用环境的3D网格图示, 并将不同框架的预测用语义网形图解图解整合成一致的表达方式。 使用原始内在和外部相机参数将语义网格网格生成一套经改进的语义分解图像。 由于我们优化的 CUDA 实施, 我们能够以不确定的方式利用以美元为单位对视频序列的批注的全方位概率分布。 我们在扫描网数据集上评估了我们的方法, 从而用语义网形图解网络 ESANet 52.05 美元 至 58.25 美元 pixel 精确度。 我们在线公布了我们框架的来源代码, 以促进这个区域的未来文本研究 (\\ http:// ligrub/ffercom suple) 的Seshereal- silveal silvildal supdufres 。