Acoustic events are sounds with well-defined spectro-temporal characteristics which can be associated with the physical objects generating them. Acoustic scenes are collections of such acoustic events in no specific temporal order. Given this natural linkage between events and scenes, a common belief is that the ability to classify events must help in the classification of scenes. This has led to several efforts attempting to do well on Acoustic Event Tagging (AET) and Acoustic Scene Classification (ASC) using a multi-task network. However, in these efforts, improvement in one task does not guarantee an improvement in the other, suggesting a tension between ASC and AET. It is unclear if improvements in AET translates to improvements in ASC. We explore this conundrum through an extensive empirical study and show that under certain conditions, using AET as an auxiliary task in the multi-task network consistently improves ASC performance. Additionally, ASC performance further improves with the AET data-set size and is not sensitive to the choice of events or the number of events in the AET data-set. We conclude that this improvement in ASC performance comes from the regularization effect of using AET and not from the network's improved ability to discern between acoustic events.
翻译:声学事件是具有明确频谱-时空特性的声学事件,可以与产生这些物体的物理物体联系起来。声学场景是这种声学事件的收集,没有具体的时间顺序。鉴于事件和场景之间的这种自然联系,一个共同的信念是,对事件进行分类的能力必须有助于对场景进行分类,这促使人们作出若干努力,试图利用多任务网络在声学事件拖曳和声学场分级方面做得更好。但是,在这些努力中,一项任务的改进并不能保证另一项工作的改进,从而表明ASC和AET之间的紧张关系。不清楚AET的改进是否转化成ASC的改进。我们通过广泛的实证研究来探讨这一难题,并表明在某些条件下,将AET作为多任务网络的辅助任务,不断提高ASC的性能。此外,ASC的性能随着AET数据集成规模的大小和事件数量的选择而没有引起注意。我们的结论是,ASC的这种改进不是从ASC的声学性能到AET的改进,而是从ARC性能网络的提高。