Advancement in speech technology has brought convenience to our life. However, the concern is on the rise as speech signal contains multiple personal attributes, which would lead to either sensitive information leakage or bias toward decision. In this work, we propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism. Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes, to derive an identity-free representation for speech emotion recognition (SER), and an emotionless representation for speaker verification (SV). Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV, comparing to the current state-of-the-art method of using adversarial learning applied on a large emotion corpora, the MSP-Podcast. Also, our proposed learning strategy reduces the model and training process needed to achieve multiple privacy-preserving tasks.
翻译:语言技术的进步为我们的生活带来了便利。然而,人们的担忧正在上升,因为语音信号包含多种个人属性,可能导致敏感信息泄漏或偏向决策。在这项工作中,我们提出一个与属性相适应的学习战略,以形成能够通过属性选择机制灵活解决这些问题的语音代表制。具体地说,我们提议了一个分层代表制的自动自动调试器(LR-VAE),将语音代表制纳入对属性敏感的节点,为语音情感识别提供无身份代表制(SER),为语音验证提供无情感代表制(SV)。我们提议的方法实现了无身份SER的竞争性表演和无情感SV的更好表现,与当前在大型情感组合(MSP-Podcast)上应用的对抗性学习方式相比。此外,我们提议的学习战略减少了实现多重隐私保护任务所需的模式和培训进程。