3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated ``detect-then-describe'' pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular \textbf{DE}tection \textbf{TR}ansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13\% and 7.11\% in CIDEr@0.5IoU, respectively. Codes will be released soon.
翻译:3D 密度大的字幕旨在生成与其相关对象区域相本地化的多个字幕。 现有的方法遵循一个精密的“ 检测- 检测- 记录” 管道, 配有众多手工制作组件。 但是, 这些手工制作的部件将产生亚最佳性能, 原因是不同场景的物体空间和级别分布不一。 在本文中, 我们提议了一个简单且有效的变压器框架 Vot2Cap- DETR, 其依据是最近流行的 \ textb{ DE}troction \ textbf{ TR}ansex (DETR) 。 与先前的艺术相比, 我们的框架具有若干吸引人的优势:(1) 不使用许多手工制作组件, 我们的方法将基于一个完整的变压器编码编码编码的编码解码解码结构, 并配有可学习的投票驱动对象变码解码解码器, 以及一个以设置定位方式生成密集字幕的字幕解码 。 2) 与两阶段制式方案相比, 我们的方法可以很快在一阶段进行探测和说明。 3, 在两个常用的数据集上, 扫描和口哨, 在两个常用的数据集上进行广泛的实验中, 扫描- 252 和N- dio- diodrefrel 和N- dis- disprequestal 。