Metaverse is an interactive world that combines reality and virtuality, where participants can be virtual avatars. Anyone can hold a concert in a virtual concert hall, and users can quickly identify the real singer behind the virtual idol through the singer identification. Most singer identification methods are processed using the frame-level features. However, expect the singer's timbre, the music frame includes music information, such as melodiousness, rhythm, and tonal. It means the music information is noise for using frame-level features to identify the singers. In this paper, instead of only the frame-level features, we propose to use another two features that address this problem. Middle-level feature, which represents the music's melodiousness, rhythmic stability, and tonal stability, and is able to capture the perceptual features of music. The timbre feature, which is used in speaker identification, represents the singers' voice features. Furthermore, we propose a convolutional recurrent neural network (CRNN) to combine three features for singer identification. The model firstly fuses the frame-level feature and timbre feature and then combines middle-level features to the mix features. In experiments, the proposed method achieves comparable performance on an average F1 score of 0.81 on the benchmark dataset of Artist20, which significantly improves related works.
翻译:模拟是一个将现实和虚拟结合起来的互动世界, 参与者可以成为虚拟变形人。 任何人都可以在虚拟音乐厅举行音乐会, 用户可以通过歌唱识别快速识别虚拟偶像背后的真正歌手。 大多数歌唱识别方法使用框架级特征处理。 但是, 期待歌手的音质框架包括音乐信息, 如喜悦、 节奏和音调等。 这意味着音乐信息是使用框架级功能识别歌手的噪音。 在本文中, 我们提议使用另外两个特征来解决这个问题。 中级特征代表音乐的喜悦性、 节奏稳定性和调音稳定性, 并且能够捕捉音乐的感知性特征。 用于语音识别的音质框架框架级特征代表歌手的音质特征。 此外, 我们提议使用一个革命性经常性神经网络( CRNN), 将三个特征组合在一起 。 模型首先将框架级特征和调音调特征和调特征结合起来, 然后将中等级的F1级模型组合成一个中等级的模型, 从而大大地改进了平均的F级模型。