Most existing human matting algorithms tried to separate pure human-only foreground from the background. In this paper, we propose a Virtual Multi-modality Foreground Matting (VMFM) method to learn human-object interactive foreground (human and objects interacted with him or her) from a raw RGB image. The VMFM method requires no additional inputs, e.g. trimap or known background. We reformulate foreground matting as a self-supervised multi-modality problem: factor each input image into estimated depth map, segmentation mask, and interaction heatmap using three auto-encoders. In order to fully utilize the characteristics of each modality, we first train a dual encoder-to-decoder network to estimate the same alpha matte. Then we introduce a self-supervised method: Complementary Learning(CL) to predict deviation probability map and exchange reliable gradients across modalities without label. We conducted extensive experiments to analyze the effectiveness of each modality and the significance of different components in complementary learning. We demonstrate that our model outperforms the state-of-the-art methods.
翻译:多数现有的人类交配算法试图将纯纯人类的表面与背景区分开来。 在本文中, 我们提出一种虚拟多模式前景图( VMFM) 方法, 以便从原始 RGB 图像中学习人类- 物体交互前景( 人和与人互动的物体) 。 VMFM 方法不需要额外的输入, 例如 trimap 或已知背景。 我们重新将 前景图作为自我监督的多模式问题进行重新配置: 将每种输入图像纳入估计深度图、 分区遮罩和 3个自动编码器的互动热映射中。 为了充分利用每种模式的特性, 我们首先训练一个双向编码器到解析器网络来估计相同的阿尔法面图。 然后我们引入一个自我监督的方法: 补充学习( CLS) 来预测偏离概率图, 并在各种模式之间无标签地交换可靠的梯度。 我们进行了广泛的实验, 分析每一种模式的有效性和不同组成部分在互补学习中的重要性。 我们证明我们的模型超越了状态方法。