Spatial audio, which focuses on immersive 3D sound rendering, is widely applied in the acoustic industry. One of the key problems of current spatial audio rendering methods is the lack of personalization based on different anatomies of individuals, which is essential to produce accurate sound source positions. In this work, we address this problem from an interdisciplinary perspective. The rendering of spatial audio is strongly correlated with the 3D shape of human bodies, particularly ears. To this end, we propose to achieve personalized spatial audio by reconstructing 3D human ears with single-view images. First, to benchmark the ear reconstruction task, we introduce AudioEar3D, a high-quality 3D ear dataset consisting of 112 point cloud ear scans with RGB images. To self-supervisedly train a reconstruction model, we further collect a 2D ear dataset composed of 2,000 images, each one with manual annotation of occlusion and 55 landmarks, named AudioEar2D. To our knowledge, both datasets have the largest scale and best quality of their kinds for public use. Further, we propose AudioEarM, a reconstruction method guided by a depth estimation network that is trained on synthetic data, with two loss functions tailored for ear data. Lastly, to fill the gap between the vision and acoustics community, we develop a pipeline to integrate the reconstructed ear mesh with an off-the-shelf 3D human body and simulate a personalized Head-Related Transfer Function (HRTF), which is the core of spatial audio rendering. Code and data are publicly available at https://github.com/seanywang0408/AudioEar.
翻译:以隐蔽的 3D 声音显示为重点的空间音频,广泛应用于声学产业。当前空间音频提供方法的关键问题之一是缺乏基于个人不同解剖的个性化,这对于产生准确的源位置至关重要。在这项工作中,我们从跨学科角度解决这一问题。空间音频的提供与人体的3D形状,特别是耳朵的3D形状密切相关。为此,我们提议通过重建3D人类耳朵,用单视图像重建3D人耳朵,实现个性化空间音频。首先,为了确定耳耳重建任务的基准,我们引入了AaudioEar3D, 高品质的3D 流频传输数据集,由112个点云耳耳耳耳扫描和 RGB 图像组成。为了自我监督地检,我们进一步收集由2,000个图像组成的2D 耳音数据集,每个图像都有手动的隐蔽和55个地标。我们了解,这两个数据集的规模最大,质量也最佳,供公众使用。此外,我们建议Dal EarM,一个重建方法,一个由我们所训练的合成数据流流流流流流流数据流流流流流流流流到深度网络,一个我们用来进行数据流流流流数据流流到直判的深度网络。