In the acoustic scene classification (ASC) task, an acoustic scene consists of diverse sounds and is inferred by identifying combinations of distinct attributes among them. This study aims to extract and cluster these attributes effectively using an improved multiple-instance learning (MIL) framework for ASC. MIL, known as a weakly supervised learning method, is a strategy for extracting an instance from a bundle of frames composing an input audio clip and inferring a scene corresponding to the input data using these unlabeled instances. However, many studies pointed out an underestimation problem of MIL. In this study, we develop a MIL framework more suitable for ASC systems by defining instance-level labels and loss to extract and cluster instances effectively. Furthermore, we design a fully separated convolutional module, which is a lightweight neural network comprising pointwise, frequency-sided depthwise, and temporal-sided depthwise convolutional filters. As a result, compared to vanilla MIL, the confidence and proportion of positive instances increase significantly, overcoming the underestimation problem and improving the classification accuracy up to 11%. The proposed system achieved a performance of 81.1% and 72.3% on the TAU urban acoustic scenes 2019 and 2020 mobile datasets with 139 K parameters, respectively. Especially, it achieves the highest performance among the systems having under the 1 M parameters on the TAU urban acoustic scenes 2019 dataset.
翻译:在声学场景分类(ASC)任务中,声学场景由多种声音组成,通过辨别不同属性的组合而推断出。本研究的目的是利用改进的ASC多因学习(MIL)框架,有效提取和分组这些属性。MIL被称为监管不力的学习方法,是从一组装有输入音频剪片的框中提取一个实例,并推断出一个与使用这些未标记实例的输入数据相对应的场景。然而,许多研究指出MIL有一个低估问题。在本研究中,我们开发了一个MIL框架,通过界定实例级标签和损失以有效提取和分组实例,从而更适合ASC系统。此外,我们设计了一个完全分离的演进式模块,这是一个轻度神经网络,由点性、频率偏差深度和时间偏向的深过滤器组成。结果是,与香拉 MIL相比,正面事件的信心和比例大大提高,克服了低估问题,并将分类精确度提高到了11%。拟议的系统在TAAU 1 和MRior 系统下分别实现了81.1% 和M. 19 最高水平数据运行。