Singing Voice Detection (SVD) has been an active area of research in music information retrieval (MIR). Currently, two deep neural network-based methods, one based on CNN and the other on RNN, exist in literature that learn optimized features for the voice detection (VD) task and achieve state-of-the-art performance on common datasets. Both these models have a huge number of parameters (1.4M for CNN and 65.7K for RNN) and hence not suitable for deployment on devices like smartphones or embedded sensors with limited capacity in terms of memory and computation power. The most popular method to address this issue is known as knowledge distillation in deep learning literature (in addition to model compression) where a large pre-trained network known as the teacher is used to train a smaller student network. Given the wide applications of SVD in music information retrieval, to the best of our knowledge, model compression for practical deployment has not yet been explored. In this paper, efforts have been made to investigate this issue using both conventional as well as ensemble knowledge distillation techniques.
翻译:歌声探测(SVD)是音乐信息检索(MIR)的一个积极研究领域。目前,有两种深层神经网络方法(一种基于CNN,另一种基于RNN)存在于学习声音探测(VD)任务优化功能和在通用数据集上取得最先进的表现的文献中,这两种模型有许多参数(CNN为1.4M,RNN为65.7K),因此不适合在智能手机或内嵌传感器等设备上部署,在记忆和计算能力方面能力有限。解决这一问题的最流行的方法是深层学习文献中的知识蒸馏(除模型压缩外),在深层学习文献中,一个称为教师的预先培训的大型网络被用来培训一个较小的学生网络。鉴于SVD在音乐信息检索方面的广泛应用,我们所了解的关于实际部署的模型压缩尚未探讨。在本文中,已经作出努力,利用常规和组合知识蒸馏技术来调查这一问题。