3D object reconstructions of transparent and concave structured objects, with inferred material properties, remains an open research problem for robot navigation in unstructured environments. In this paper, we propose a multimodal single- and multi-frame neural network for 3D reconstructions using audio-visual inputs. Our trained reconstruction LSTM autoencoder 3D-MOV accepts multiple inputs to account for a variety of surface types and views. Our neural network produces high-quality 3D reconstructions using voxel representation. Based on Intersection-over-Union (IoU), we evaluate against other baseline methods using synthetic audio-visual datasets ShapeNet and Sound20K with impact sounds and bounding box annotations. To the best of our knowledge, our single- and multi-frame model is the first audio-visual reconstruction neural network for 3D geometry and material representation.
翻译:3D目标的透明、凝固结构物体的重建,加上推断的物质特性,仍然是在非结构化环境中机器人导航的一个公开研究问题。在本文件中,我们提议利用视听投入为3D重建建立一个多式单一和多框架神经网络。我们经过训练的重建LSTM自动编码器 3D-MOV接受多种投入,以考虑各种表面类型和观点。我们的神经网络利用 voxel 代表制,产生了高质量的3D重建。根据交叉联盟(IoU),我们用合成视听数据集ShapeNet和Sound20K(Sound20K)的冲击声和捆绑框说明来对照其他基线方法进行评估。我们最了解的是,我们的单一和多框架模型是第一个用于3D几何和材料代表制的视听重建神经网络。