We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.
翻译:我们提出了一个基于语义信息的目标语音提取的新框架,称为“概念Beam ” 。 目标语言提取意味着在混合中提取目标演讲者的演讲。 典型的方法是利用音频信号的特性, 如调和结构和到达方向。 相反, 概念Beam 以语义线索解决问题。 具体地说, 我们用图像或语言等概念标语来表达一个概念。 解决这个新问题将打开创新应用的大门, 如以谈话中讨论的特定主题为焦点的监听系统。 与关键词不同, 概念是抽象的概念, 直接代表目标概念的概念。 相反, 概念是用语义结构的标语义嵌入到共同的嵌入空间。 这个模式依赖空间可以通过使用由图像或语言描述组成的配对数据来进行深度衡量。 我们用它来连接基于模式的信息, 也就是说, 语言组合中的语系是抽象概念, 以我们所使用的语言缩略图模式为直线路路, 以我们所使用的语言缩图的缩略图模型为直径, 。 我们使用这些语言缩略图的缩略图的缩图的缩图是用来显示我们所使用的缩略图的缩图的缩图, 。