Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligibility of sung lyrics. To tackle this challenge, we propose the MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer. We first adapt the wav2vec 2.0 framework from automatic speech recognition (ASR) to the ALT task. We then propose a video-based ALT method and an IMU-based voice activity detection (VAD) method. In addition, we put forward the Residual Cross Attention (RCA) mechanism to fuse data from the three modalities (i.e., audio, video, and IMU). Experiments show the effectiveness of our proposed MM-ALT system, especially in terms of noise robustness. Project page is at https://n20em.github.io.
翻译:自动读音系统(ALT)是一个新生的研究领域,它吸引了语言和音乐信息检索社区越来越多的兴趣,因为它具有巨大的应用潜力。然而,单凭音频数据的ALT是一个众所周知的困难任务,因为有工具的配合和音乐限制导致语音提示的退化和歌词的可感性。为了应对这一挑战,我们提议采用多式自动读音记录系统(MM-ALT),以及一个新的数据集N20EM,它包括音乐录音、嘴唇运动视频和表演歌手戴的耳膜测量单位(IMU)数据。我们首先将Wav2vec 2.0框架从自动语音识别(ASR)改为自动语音识别(ALT)任务。我们然后提出以视频为基础的ALT方法和以IMU为基础的语音活动探测(VAD)方法。此外,我们提出了从三种模式(即音频、视频和惯性测量单位)中整合数据(N20EMMU)的机制。实验显示我们拟议的AM-LT系统的有效性,特别是在MALTM系统中。