智能社交人形化情感系统 (Affective social anthropomorphic intelligent system)

Human conversational styles are measured by the sense of humor, personality, and tone of voice. These characteristics have become essential for conversational intelligent virtual assistants. However, most of the state-of-the-art intelligent virtual assistants (IVAs) are failed to interpret the affective semantics of human voices. This research proposes an anthropomorphic intelligent system that can hold a proper human-like conversation with emotion and personality. A voice style transfer method is also proposed to map the attributes of a specific emotion. Initially, the frequency domain data (Mel-Spectrogram) is created by converting the temporal audio wave data, which comprises discrete patterns for audio features such as notes, pitch, rhythm, and melody. A collateral CNN-Transformer-Encoder is used to predict seven different affective states from voice. The voice is also fed parallelly to the deep-speech, an RNN model that generates the text transcription from the spectrogram. Then the transcripted text is transferred to the multi-domain conversation agent using blended skill talk, transformer-based retrieve-and-generate generation strategy, and beam-search decoding, and an appropriate textual response is generated. The system learns an invertible mapping of data to a latent space that can be manipulated and generates a Mel-spectrogram frame based on previous Mel-spectrogram frames to voice synthesize and style transfer. Finally, the waveform is generated using WaveGlow from the spectrogram. The outcomes of the studies we conducted on individual models were auspicious. Furthermore, users who interacted with the system provided positive feedback, demonstrating the system's effectiveness.

翻译：人类的谈话风格通常通过幽默感、个性和语调来衡量。这些特征对于智能虚拟助手变得至关重要。然而，大多数最先进的智能虚拟助手（IVAs）无法解释人类语音的情感语义。本研究提出了一种人形化智能系统，可以具有情感和个性，进行正常人类般的对话。还提出了一种语音风格转移方法，用于映射特定情绪的属性。最初，通过将时间音频波数据转换为Mel-Spectrogram来创建频域数据，其中包括音频功能的离散模式，例如音符、音高、节奏和旋律。使用CNN-Transformer-Encoder同时将语音输入到深度语音（Deep-speech），该模型可以从频谱图生成文本转录。然后，使用混合技能对话、基于transformer的retrieve-and-generate生成策略以及beam-search解码并生成适当的文本响应将转录文本传输到多领域对话代理。该系统学习了从数据到潜在空间的可逆映射，可以操纵并基于先前的Mel-Spectrogram帧生成Mel-spectrogram帧来进行语音合成和风格转移。最后，使用WaveGlow从频谱图生成波形。我们进行的个体模型的研究结果是令人赞赏的。此外，与该系统交互的用户提供了积极的反馈，证明了该系统的有效性。