项目名称: 音视融合的韵律模式的个性化研究
项目编号: No.60805008
项目类型: 青年科学基金项目
立项/批准年度: 2009
项目学科: 金属学与金属工艺
项目作者: 吴志勇
作者单位: 清华大学
项目金额: 20万元
中文摘要: 本项目旨在对说话人言语表达的韵律模式进行分析,考察音视觉特征的变化规律及交互作用;分析不同人的韵律模式差异,建立个性化模型;研究个性化韵律模式的生成算法,实现个性化虚拟说话人生成。 在个性化音视频数据方面,采取语义情境描述的方式设计语料,采集3女2男的多情境数据库;采用SSML订立统一标注体系,并扩展韵律结构及表现力相关的标记格式。 在韵律模式建模方面,将韵律词中音节按照与"核心音节"的位置关系分为4类,分析其声学特征差异,提出一种面向韵律模式生成的非线性韵律叠加模型;并提出基于韵律结构的全局与局部结合的方法,实现层次化的韵律分析与建模。 在个性化韵律建模与生成方面,研究不同说话人的音高特征变化,提出一种能反映说话人特点的音高模式,并给出描述说话人音高特点的参数化方法;进而提出一种双层非线性叠加模型进行个性化韵律模式建模与生成。 针对个性化表情脸像生成,提出一种语义维度的方法,通过语义特征描述说话人的个性化表情脸像特征;并以语义特征为基础,将文字语音脸像有机整合;进而将FAP参数驱动的表情脸像生成方法用于特定说话人照片,实现个性化人脸表情生成算法。最终构建个性化的虚拟说话人系统。
中文关键词: 韵律模式;个性化建模;音视融合;可视韵律;可视语音合成
英文摘要: Prosodic pattern refer to the evolution characteristics of audio and visual bimodal correlates over time in human speech. The objective of this project is to develop a personified expressive talking avatar to express audio visual speeches with personalized prosodic patterns. The major research outcome include: For audio visual bimodal corpus - 10 passages under different situation were designed for typical emotional categories (e.g. exuberant, relaxed, disgusted, angry, sad, etc). In each passage, we embedded a sentence that is emotionally unbiased. An audiovisual corpus with 3 female 2 male reading the passages was collected. SSML was adopted as the framework to annotate the corpus with extending tags for prosody structure and expressivity. For prosody pattern modeling - based on the observation that speaker tends to put more emphasis on one particular syllable in a word, we identified such syllable as the "score syllable" and classified syllables into 4 classes based on their relations with core syllable. We analyzed speech recordings for each of 4 classes and developed a perturbation model that considers prosody pattern to transform neutral speech to expressive one. We also developed a hierarchical framework based on prosody structures for predicting speech prosody, where global features were incorporated while modeling local features. For personified prosody pattern - we proposed a method to describe different prosody styles of pitch contours from didderent speakers based on the analysis of their contours. We further proposed a double-layer perturbation model to model such prosody styles, which is applied to generate prosody patterns for personified speech generation. In generating facial expressions for personified talking avatar - we proposed an approach to synthesize facial expression based on semantic dimensions. 7 semantic dimensions were defined to describe information such as emotion, attitude and intensions.ANN based mapping model between semantic dimensions and facial parameters were built. An MPEG-4 based method was then proposed for personalized human face morphing and expression synthesis, which takes a picture of neutral human face and a group of FAPs as input to generate facial expression images for different speakers. Finally, a personified expressive talking avatar was developped who can express audio and visual speeches with personalized prosodic patterns.
英文关键词: Prosodic Pattern; Personalized Modeling; Audio-Visual Bimodal Modeling; Visual Prosody; Text-to-Audio-Visual-Speech Synthesis