An accurate model of natural speech directivity is an important step toward achieving realistic vocal presence within a virtual communication setting. In this article, we propose a method to estimate and reconstruct the spatial energy distribution pattern of natural, unconstrained speech. We detail our method in two stages. Using recordings of speech captured by a real, static microphone array, we create a virtual array that tracks with the movement of the speaker over time. We use this egocentric virtual array to measure and encode the high-resolution directivity pattern of the speech signal as it dynamically evolves with natural speech and movement. Utilizing this encoded directivity representation, we train a machine learning model that leverages to estimate the full, dynamic directivity pattern when given a limited set of speech signals, as would be the case when speech is recorded using the microphones on a head-mounted display (HMD). We examine a variety of model architectures and training paradigms, and discuss the utility and practicality of each implementation. Our results demonstrate that neural networks can be used to regress from limited speech information to an accurate, dynamic estimation of the full directivity pattern.
翻译:自然语音直接率的准确模型是朝着在虚拟通信环境中实现现实的语音存在迈出的重要一步。 在本条中, 我们提出一种方法来估计和重新构建自然不受限制的语音的空间能量分布模式。 我们分两个阶段详细描述我们的方法。 使用真实的静态麦克风阵列所拍摄的语音记录, 我们创建了一个虚拟阵列, 跟踪演讲者随时间变化的动向。 我们使用这种以自我为中心的虚拟阵列来衡量和编码该语音信号的高分辨率直接率模式, 因为它随着自然语音和运动的动态演进而变化。 利用这一编码直接率代表, 我们训练了一种机器学习模型, 利用一套有限的语音信号, 来估计完整、 动态直接率模式, 使用头部显示( HMD) 的麦克风来记录演讲。 我们检查了各种模型结构和培训范式, 并讨论每次执行的效用和实用性。 我们的结果表明, 神经网络可以用来从有限的语音信息向完全直接率模式的准确、 动态估计。