Audio segmentation and sound event detection are crucial topics in machine listening that aim to detect acoustic classes and their respective boundaries. It is useful for audio-content analysis, speech recognition, audio-indexing, and music information retrieval. In recent years, most research articles adopt segmentation-by-classification. This technique divides audio into small frames and individually performs classification on these frames. In this paper, we present a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. This is done by having separate output neurons to detect the presence of an audio class and predict its start and end points. The relative improvement for F-measure of YOHO, compared to the state-of-the-art Convolutional Recurrent Neural Network, ranged from 1% to 6% across multiple datasets for audio segmentation and sound event detection. As the output of YOHO is more end-to-end and has fewer neurons to predict, the speed of inference is at least 6 times faster than segmentation-by-classification. In addition, as this approach predicts acoustic boundaries directly, the post-processing and smoothing is about 7 times faster.
翻译:音频分解和音频事件检测是机器监听中的关键话题,目的是检测声学类及其各自的界限。 它对于音频内容分析、语音识别、音量索引和音乐信息检索非常有用。 近年来,大多数研究文章采用逐级分解法。 这个技术将音频分为小框架,并单独对这些框架进行分类。 在本文中, 我们展示了一种新颖的方法, 名为“ 你只听一次( YOHO) ”, 由计算机视野中流行的YOLO 算法启发。 我们将声学边界的检测转换成一个回归问题, 而不是基于框架的分类。 这样做的方法是, 由不同的输出神经元来检测音频类的存在, 并预测其开始和结束点。 YOHO 的F度测量相对改进, 与最先进的演进常规神经网络相比, 在多个数据集之间, 1%到6% 用于音频段分解和声音事件检测。 由于YOHOO的输出是更接近尾端端, 且神经量较少预测, 推判速度至少是6倍的平时程速度。