Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques. We propose to apply the visual information corresponding to the singers' vocal activities to further improve the quality of the separated vocal signals. The video frontend model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework. To facilitate the network to learn audiovisual correlation of singing activities, we add extra vocal signals irrelevant to the mouth movement to the audio mixture during training. We create two audiovisual singing performance datasets for training and evaluation, respectively, one curated from audition recordings on the Internet, and the other recorded in house. The proposed method outperforms audio-based methods in terms of separation quality on most test recordings. This advantage is especially pronounced when there are backing vocals in the accompaniment, which poses a great challenge for audio-only methods.
翻译:将歌曲分离成声带和伴奏部分是一个积极的研究课题,近年来,通过使用深层学习技巧的监督培训提高了工作表现。 我们提议应用与歌手声响活动相应的视觉信息进一步提高音响信号的质量。 视频前端模型吸收口音的输入,并将其结合到音频分离框架的外嵌功能中。 为了便利网络学习歌唱活动的视听相关性,我们在培训期间将与口音运动无关的额外声响信号添加到音频混合中。 我们为培训和评估分别创建了两个音频歌表演数据集,一个来自互联网试音记录,另一个来自室内记录。 所提议的方法在大多数试录的音质量方面优于音频基方法。 当伴音中支持声音时,这一优势特别明显,这对只使用音法的方法构成巨大挑战。