Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at downstream tasks such as retrieval and classification. On the other hand, work on joint representation learning for 3D shapes and text has thus far mostly focused on improving embeddings through modeling of complex attention between representations , or multi-task learning . We show that with large batch contrastive learning we achieve SoTA on text-shape retrieval without complex attention mechanisms or losses. Prior work in 3D and text representations has also focused on bimodal representation learning using either voxels or multi-view images with text. To this end, we propose a trimodal learning scheme to achieve even higher performance and better representations for all modalities.
翻译:最近关于学习多式联运数据联合嵌入的对比性损失的工作在诸如检索和分类等下游任务中取得了成功。另一方面,关于3D形状和文字的联合代表学习的工作迄今为止主要侧重于通过模拟演示之间的复杂关注或多任务学习来改进嵌入。我们表明,通过大量对比性学习,我们在没有复杂关注机制或损失的情况下在文本形状检索方面实现了大量对比性学习。先前的3D和文字表述工作还侧重于使用 voxels 或带文本的多视图图像进行双模式代表学习。为此,我们建议采用三模式学习计划,以实现更高的绩效,更好地表述所有模式。