用于声频场分级的长话声学强力专题学习 (Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification)

Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded. The log-mel feature and convolutional neural network (CNN) have recently become the most popular time-frequency (TF) feature representation and classifier in ASC. An audio signal recorded in a scene may include various sounds overlapping in time and frequency. The previous study suggests that separately considering the long-duration sounds and short-duration sounds in CNN may improve ASC accuracy. This study addresses the problem of the generalization ability of acoustic scene classifiers. In practice, acoustic scene signals' characteristics may be affected by various factors, such as the choice of recording devices and the change of recording locations. When an established ASC system predicts scene classes on audios recorded in unseen scenarios, its accuracy may drop significantly. The long-duration sounds not only contain domain-independent acoustic scene information, but also contain channel information determined by the recording conditions, which is prone to over-fitting. For a more robust ASC system, We propose a robust feature learning (RFL) framework to train the CNN. The RFL framework down-weights CNN learning specifically on long-duration sounds. The proposed method is to train an auxiliary classifier with only long-duration sound information as input. The auxiliary classifier is trained with an auxiliary loss function that assigns less learning weight to poorly classified examples than the standard cross-entropy loss. The experimental results show that the proposed RFL framework can obtain a more robust acoustic scene classifier towards unseen devices and cities.

翻译：声频场景分类( ASC) 旨在识别记录特定音频信号的场景类型( 环境) 。日录特征和进化神经网络( CNN) 近来已成为 ASC 中最流行的时间频率特征显示和分类器。在场景中记录的音频信号可能包含时间和频率上的各种声音重叠。前一项研究表明, 单独考虑CNN 的长期声音和短时间声音可以提高ASC 的准确性。本研究解决了声音场景分类员一般化能力的问题。实际上, 声频场信号的特性可能受到各种因素的影响, 如记录装置的选择和记录地点的变化等。当已经建立的 ASC 系统预测在不可见的场景场景中记录的场景类别时, 其准确性可能会显著下降。长期音频信号不仅包含依赖域的声频场景信息, 而且还包含由记录条件决定的频道信息, 这很容易被提议过度适应。对于更坚固的 ASC 系统, 我们建议一个强的特征学习框架( RFLL) 来训练CNN 。。 RL 的将的快速成本格式框架具体地标, 将学习以的的的的以以的长期的的排序学习的的的,, 的以的的的递增递增递增递增递增递增的的的的递增的的的的的递增递增的递增递增的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的, 它的的的的的的的的的的的的的的的的, 是的的的的的的的的的的