宽带音频波变评价网络:高效率、准确估计发言质量 (Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities)

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Wideband Audio Waveform Evaluation Networks (WAWEnets) are convolutional neural networks that operate directly on wideband audio waveforms in order to produce evaluations of those waveforms. In the present work these evaluations give qualities of telecommunications speech (e.g., noisiness, intelligibility, overall speech quality). WAWEnets are no-reference networks because they do not require ``reference'' (original or undistorted) versions of the waveforms they evaluate. Our initial WAWEnet publication introduced four WAWEnets and each emulated the output of an established full-reference speech quality or intelligibility estimation algorithm. We have updated the WAWEnet architecture to be more efficient and effective. Here we present a single WAWEnet that closely tracks seven different quality and intelligibility values. We create a second network that additionally tracks four subjective speech quality dimensions. We offer a third network that focuses on just subjective quality scores and achieves very high levels of agreement. This work has leveraged 334 hours of speech in 13 languages, over two million full-reference target values and over 93,000 subjective mean opinion scores. We also interpret the operation of WAWEnets and identify the key to their operation using the language of signal processing: ReLUs strategically move spectral information from non-DC components into the DC component. The DC values of 96 output signals define a vector in a 96-D latent space and this vector is then mapped to a quality or intelligibility value for the input waveform.

翻译：WAWEnets(WAWEnets)是直接在宽频音波形上运行的神经网络,目的是对这些波形进行评价。在目前的工作中,这些评价使电信言论(例如,音量、智能、整体语音质量)具有质素;WAWEnets不具有参考性,因为它们不需要“参考”(原始或未扭曲)版本的波形。我们最初的WAWEnet出版物引入了4个WAWEnets,每个网络都效仿了既定的全面参考语音质量或智能估计算法的输出。我们更新了WAWAWEnet的架构,使其更加高效和有效。我们在这里提出了一个单一的WAWEnet网络,它密切跟踪7个不同的质量和感知性价值。我们创建了另一个网络,它们又跟踪四个主观的语音质量层面。我们提供了第三个网络,侧重于主观质量分数,并实现了很高的一致程度。这项工作利用了13种语言的334小时的语音,超过200万个完整参考目标值或智能估计值。我们用96-WA的完整图像格式将S-Realimalalalalalalal 将Seral-deal-de a der a a delisal-deal der ax axxxxxxxxxx axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx