The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural language to real-world 3D data. In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances inside a 3D scene. Specifically, our Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k natural referential sentences, covering 705 real-world scenes. Crucially, we show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA in both the Nr3D and ScanRefer benchmarks by 4.3% and 5.0%, respectively. Moreover, we experiment with competitive baselines and recent methods for the task of language generation and show that, as with neural listeners, 3D neural speakers can also noticeably benefit by training with ScanEnts3D, including improving the SoTA by 13.2 CIDEr points on the Nr3D benchmark. Overall, our carefully conducted experimental studies strongly support the conclusion that, by learning on ScanEnts3D, commonly used visio-linguistic 3D architectures can become more efficient and interpretable in their generalization without needing to provide these newly collected annotations at test time. The project's webpage is https://scanents3d.github.io/ .
翻译:两个受欢迎的数据集 ScanEnts3D [16] 和ReferIT3D [3] 将自然语言与真实世界的 3D 数据连接起来。 在本文中,我们通过将特例句中提及的所有对象与3D 场景中的基本情况联系起来,将上述两种数据都连接起来,从而将上述两个系统都连接起来。 具体地说,我们的3D (ScanEnts3D) 数据库中的扫描实体提供了跨84k 自然优处84k 句的369k 个对象之间的明确对应,涵盖705 个真实世界场景。 令人惊讶的是,我们通过纳入能够从这个新数据集中学习的直观损失,我们可以大大改善最近推出的一些神经听结构的性功能,包括将Nr3D 和Scrifer 基准分别改进4.3% 和 5.0% 。 此外,我们试验了竞争性基线和最新的语言生成方法,并表明,与听者一样,3D 3 NED 演讲者也可以明显地通过ScenEnents3D 培训, 包括13 CID CID 改进 Scialestalestalalalalalalismres realisurismismismismism 。