Zero-shot learning on 3D point cloud data is a related underexplored problem compared to its 2D image counterpart. 3D data brings new challenges for ZSL due to the unavailability of robust pre-trained feature extraction models. To address this problem, we propose a prompt-guided 3D scene generation and supervision method that augments 3D data to learn the network better, exploring the complex interplay of seen and unseen objects. First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene. Later, we perform contrastive learning to train our proposed architecture in an end-to-end manner. We argue that 3D scenes can relate objects more efficiently than single objects because popular language models (like BERT) can achieve high performance when objects appear in a context. Our proposed prompt-guided scene generation method encapsulates data augmentation and prompt-based annotation/captioning to improve 3D ZSL performance. We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.
翻译:与 2D 图像对应方相比, 3D 数据为 ZSL 带来了新的挑战。 3D 数据给 ZSL 带来了新的挑战, 原因是没有经过预先训练的强力地段提取模型。 为了解决这个问题, 我们建议了一种快速制导的 3D 场景生成和监督方法, 该方法可以增加 3D 数据, 从而更好地学习网络, 探索可见和看不见天体的复杂相互作用。 首先, 我们用快速描述的方式将两个 3D 模型的点云合并起来。 快速动作, 比如描述每个 3D 场景的注解。 后来, 我们进行了对比性学习, 以端对端方式培训我们提议的架构。 我们说, 3D 场景可以比单一对象更高效地连接对象, 因为流行语言模型( 如 BERT ) 当对象出现在一个背景中时能够取得高性能。 我们提议的 快速制场景生成方法包含数据增强和快速的注解/ 来改进 3D ZSL 性能的性能。 我们已经在合成对象( MONet40, Net10 和 real-NESnal- sang) 目标上实现了 3D 。