Strong evidence suggests that humans perceive the 3D world by parsing visual scenes and objects into part-whole hierarchies. Although deep neural networks have the capability of learning powerful multi-level representations, they can not explicitly model part-whole hierarchies, which limits their expressiveness and interpretability in processing 3D vision data such as point clouds. To this end, we propose an encoder-decoder style latent variable model that explicitly learns the part-whole hierarchies for the multi-level point cloud segmentation. Specifically, the encoder takes a point cloud as input and predicts the per-point latent subpart distribution at the middle level. The decoder takes the latent variable and the feature from the encoder as an input and predicts the per-point part distribution at the top level. During training, only annotated part labels at the top level are provided, thus making the whole framework weakly supervised. We explore two kinds of approximated inference algorithms, i.e., most-probable-latent and Monte Carlo methods, and three stochastic gradient estimations for learning discrete latent variables, i.e., straight-through, REINFORCE, and pathwise estimators. Experimental results on the PartNet dataset show that the proposed method achieves state-of-the-art performance in not only top-level part segmentation but also middle-level latent subpart segmentation.
翻译:强烈的证据表明,人类通过将视觉场景和对象剖析成半圆形的等级来看待3D世界。尽管深神经网络有能力学习强大的多层次代表,但他们不能明确地模拟全半结构结构,这限制了它们在处理3D视觉数据(如点云)时的表达性和可解释性。为此,我们提议了一个编码器脱coder风格潜伏模型,明确学习多层次云层的半整层结构。具体地说,编码器将点云作为输入并预测中层的每点隐藏分层分布。脱co器将潜伏变量和编码器的特征作为输入,并预测顶层的每点部分分布。在培训期间,只提供顶层附加部分标签,从而使整个框架受到薄弱监督。我们只探索两种近似但大致的推断算法,即,最精确的和蒙特卡洛分层的分层分层分层分布图,以及三个深层次的分层变分层数据,用于学习离层、高层次的内层变分级数据。