Weakly labelled audio tagging aims to predict the classes of sound events within an audio clip, where the onset and offset times of the sound events are not provided. Previous works have used the multiple instance learning (MIL) framework, and exploited the information of the whole audio clip by MIL pooling functions. However, the detailed information of sound events such as their durations may not be considered under this framework. To address this issue, we propose a novel two-stream framework for audio tagging by exploiting the global and local information of sound events. The global stream aims to analyze the whole audio clip in order to capture the local clips that need to be attended using a class-wise selection module. These clips are then fed to the local stream to exploit the detailed information for a better decision. Experimental results on the AudioSet show that our proposed method can significantly improve the performance of audio tagging under different baseline network architectures.
翻译:贴有微弱标签的音频标签旨在预测音频剪辑中的音频事件类别,其中没有提供音频事件的开始时间和抵消时间。以前的作品使用了多实例学习(MIL)框架,并利用了MIL集合功能的整个音频剪辑的信息。然而,在这个框架内可能不考虑音频事件的详细信息,如其持续时间等。为解决这一问题,我们提议了一个新的双流框架,通过利用全球和当地音频事件信息进行音频标记。全球流的目的是分析整个音频剪辑,以便利用一个等级选择模块捕捉需要观看的本地剪辑。这些剪辑随后被输入到本地流,以利用详细信息更好地作出决定。音频图的实验结果显示,我们提出的方法可以显著改善不同基线网络架构下音频标记的性能。