With the recent growth of remote work, online meetings often encounter challenging audio contexts such as background noise, music, and echo. Accurate real-time detection of music events can help to improve the user experience. In this paper, we present MusicNet, a compact neural model for detecting background music in the real-time communications pipeline. In video meetings, music frequently co-occurs with speech and background noises, making the accurate classification quite challenging. We propose a compact convolutional neural network core preceded by an in-model featurization layer. MusicNet takes 9 seconds of raw audio as input and does not require any model-specific featurization in the product stack. We train our model on the balanced subset of the Audio Set~\cite{gemmeke2017audio} data and validate it on 1000 crowd-sourced real test clips. Finally, we compare MusicNet performance with 20 state-of-the-art models. MusicNet has a true positive rate (TPR) of 81.3% at a 0.1% false positive rate (FPR), which is significantly better than state-of-the-art models included in our study. MusicNet is also 10x smaller and has 4x faster inference than the best performing models we benchmarked.
翻译:随着远程工作的最近增长,在线会议经常遇到具有挑战性的音频环境,如背景噪音、音乐和回声。准确实时探测音乐事件可以帮助改善用户体验。在本论文中,我们介绍MusicNet,这是一个用于实时通信管道中探测背景音乐的紧凑神经模型。在视频会议中,音乐经常以语音和背景噪音共同出现,使准确的分类变得相当具有挑战性。我们提议在模型成型层之前建立一个紧凑的神经神经网络核心。音乐网需要9秒钟原始音频作为投入,不需要产品堆中的任何模型化成型。我们用Set ⁇ cite{gemeke2017audio}数据中平衡的一组来培训我们的模型,并以1000个众源真实的测试剪辑来验证它。最后,我们将音乐网的性能与20个最先进的模型相比较。音乐网的正率为81.3%,以0.1%的假正率为正率(FPR),这比我们研究中包含的4个模型的状态要好得多。我们进行4x的最佳模型也比我们进行得更快。