Acoustic scene classification is an automatic listening problem that aims to assign an audio recording to a pre-defined scene based on its audio data. Over the years (and in past editions of the DCASE) this problem has often been solved with techniques known as ensembles (use of several machine learning models to combine their predictions in the inference phase). While these solutions can show performance in terms of accuracy, they can be very expensive in terms of computational capacity, making it impossible to deploy them in IoT devices. Due to the drift in this field of study, this task has two limitations in terms of model complexity. It should be noted that there is also the added complexity of mismatching devices (the audios provided are recorded by different sources of information). This technical report makes a comparative study of two different network architectures: conventional CNN and Conv-mixer. Although both networks exceed the baseline required by the competition, the conventional CNN shows a higher performance, exceeding the baseline by 8 percentage points. Solutions based on Conv-mixer architectures show worse performance although they are much lighter solutions.
翻译:声学场景分类是一个自动听觉问题,目的是根据音频数据向预先确定的场景分配录音。多年来(以及过去在DCASE的版本),这个问题经常通过称为组合的技术(使用若干机器学习模型来结合推论阶段的预测)得到解决。虽然这些解决办法可以显示准确性方面的性能,但在计算能力方面可能非常昂贵,无法在IoT装置中部署。由于这个研究领域的漂移,这项任务在模型复杂性方面有两个局限性。应当指出,还存在增加的错配装置复杂性(提供的声音由不同信息来源记录)。这份技术报告对两种不同的网络结构进行了比较研究:常规CNN和Conv-mixer。虽然这两个网络都超过了竞争所要求的基线,但传统的CNN显示更高的性能,超过基线8个百分点。基于Con-mixer结构的解决方案显示的性能更差,尽管其解决办法较轻。