音视融合的韵律模式的个性化研究

项目名称： 音视融合的韵律模式的个性化研究

项目编号： No.60805008

项目类型： 青年科学基金项目

立项/批准年度： 2009

项目学科： 金属学与金属工艺

项目作者： 吴志勇

作者单位： 清华大学

项目金额： 20万元

中文摘要： 本项目旨在对说话人言语表达的韵律模式进行分析，考察音视觉特征的变化规律及交互作用；分析不同人的韵律模式差异，建立个性化模型；研究个性化韵律模式的生成算法，实现个性化虚拟说话人生成。在个性化音视频数据方面，采取语义情境描述的方式设计语料，采集3女2男的多情境数据库；采用SSML订立统一标注体系，并扩展韵律结构及表现力相关的标记格式。在韵律模式建模方面，将韵律词中音节按照与"核心音节"的位置关系分为4类，分析其声学特征差异，提出一种面向韵律模式生成的非线性韵律叠加模型；并提出基于韵律结构的全局与局部结合的方法，实现层次化的韵律分析与建模。在个性化韵律建模与生成方面，研究不同说话人的音高特征变化，提出一种能反映说话人特点的音高模式，并给出描述说话人音高特点的参数化方法；进而提出一种双层非线性叠加模型进行个性化韵律模式建模与生成。针对个性化表情脸像生成，提出一种语义维度的方法，通过语义特征描述说话人的个性化表情脸像特征；并以语义特征为基础，将文字语音脸像有机整合；进而将FAP参数驱动的表情脸像生成方法用于特定说话人照片，实现个性化人脸表情生成算法。最终构建个性化的虚拟说话人系统。

中文关键词： 韵律模式;个性化建模;音视融合;可视韵律;可视语音合成

英文摘要： Prosodic pattern refer to the evolution characteristics of audio and visual bimodal correlates over time in human speech. The objective of this project is to develop a personified expressive talking avatar to express audio visual speeches with personalized prosodic patterns. The major research outcome include: For audio visual bimodal corpus - 10 passages under different situation were designed for typical emotional categories (e.g. exuberant, relaxed, disgusted, angry, sad, etc). In each passage, we embedded a sentence that is emotionally unbiased. An audiovisual corpus with 3 female 2 male reading the passages was collected. SSML was adopted as the framework to annotate the corpus with extending tags for prosody structure and expressivity. For prosody pattern modeling - based on the observation that speaker tends to put more emphasis on one particular syllable in a word, we identified such syllable as the "score syllable" and classified syllables into 4 classes based on their relations with core syllable. We analyzed speech recordings for each of 4 classes and developed a perturbation model that considers prosody pattern to transform neutral speech to expressive one. We also developed a hierarchical framework based on prosody structures for predicting speech prosody, where global features were incorporated while modeling local features. For personified prosody pattern - we proposed a method to describe different prosody styles of pitch contours from didderent speakers based on the analysis of their contours. We further proposed a double-layer perturbation model to model such prosody styles, which is applied to generate prosody patterns for personified speech generation. In generating facial expressions for personified talking avatar - we proposed an approach to synthesize facial expression based on semantic dimensions. 7 semantic dimensions were defined to describe information such as emotion, attitude and intensions.ANN based mapping model between semantic dimensions and facial parameters were built. An MPEG-4 based method was then proposed for personalized human face morphing and expression synthesis, which takes a picture of neutral human face and a group of FAPs as input to generate facial expression images for different speakers. Finally, a personified expressive talking avatar was developped who can express audio and visual speeches with personalized prosodic patterns.

英文关键词： Prosodic Pattern; Personalized Modeling; Audio-Visual Bimodal Modeling; Visual Prosody; Text-to-Audio-Visual-Speech Synthesis

成为VIP会员查看完整内容