The development of audio event recognition systems require labeled training data, which are generally hard to obtain. One promising source of recordings of audio events is the large amount of multimedia data on the web. In particular, if the audio content analysis must itself be performed on web audio, it is important to train the recognizers themselves from such data. Training from these web data, however, poses several challenges, the most important being the availability of labels: labels, if any, that may be obtained for the data are generally weak, and not of the kind conventionally required for training detectors or classifiers. We propose that learning algorithms that can exploit weak labels offer an effective method to learn from web data. We then propose a robust and efficient deep convolutional neural network (CNN) based framework to learn audio event recognizers from weakly labeled data. The proposed method can train from and analyze recordings of variable length in an efficient manner and outperforms a network trained with strongly labeled web data by a considerable margin. Moreover, even though we learn from weakly labeled data, where event time stamps within the recording are not available during training, our proposed framework is able to localize events during the inference stage.
翻译:音频事件识别系统的开发需要标签式的培训数据,通常很难获得。音频事件记录的一个大有希望的来源是网上的大量多媒体数据。特别是,如果音频内容分析本身必须在网络音频上进行,则必须用这些数据对识别者进行培训。但是,利用这些网络数据进行的培训带来了若干挑战,其中最重要的是提供标签:为数据可能获得的任何标签一般都很薄弱,而不是培训探测器或分类员通常需要的那种类型。我们建议,学习能够利用薄弱标签的算法是从网上数据中学习的有效方法。我们然后提议一个以动态和高效的深层神经网络为基础的框架,以便从薄弱标签式的数据中学习音频事件识别者。拟议的方法可以有效地从不同长度的录音中培训和分析,并且比受过严格标签式网络数据培训的网络高出相当大的空间。此外,尽管我们从薄弱的标签式数据中学习,在培训阶段无法提供事件贴图时,我们提出的框架能够在活动过程中实现本地化。