A long-standing goal in the field of sensory substitution is to enable sound perception for deaf and hard of hearing (DHH) people by visualizing audio content. Different from existing models that translate to hand sign language, between speech and text, or text and images, we target immediate and low-level audio to video translation that applies to generic environment sounds as well as human speech. Since such a substitution is artificial, without labels for supervised learning, our core contribution is to build a mapping from audio to video that learns from unpaired examples via high-level constraints. For speech, we additionally disentangle content from style, such as gender and dialect. Qualitative and quantitative results, including a human study, demonstrate that our unpaired translation approach maintains important audio features in the generated video and that videos of faces and numbers are well suited for visualizing high-dimensional audio features that can be parsed by humans to match and distinguish between sounds and words. Code and models are available at https://chunjinsong.github.io/audioviewer
翻译:在感官替代领域,一个长期的目标就是通过可视化音频内容,使聋人和重听(DHH)人能够对耳聋人和重听(DHH)人有正确的认识。不同于现有的手语翻译模式,语言和文字或文字和图像之间,我们的目标是将即时和低层次的音频翻译与适用于普通环境声音和人类言语的视频翻译作为目标。由于这种替代是人为的,没有监督学习的标签,我们的核心贡献是建立一个从音频到视频的绘图,通过高层次限制从未描述的实例中学习。关于言论,我们从性别与方言等风格中进一步混淆内容。定性和定量结果,包括一项人类研究,表明我们制作的视频中保留着重要的音频特征,而脸和数字的视频非常适合将人可以描述的高维声和文字相匹配和区分的高频特征进行视觉化。代码和模型见https://chunjinsong.github.io/audioviewer。