Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous work shows that a good detection performance relies on fully-annotated data. However, collecting fully-annotated data is labor-extensive. Therefore, we consider TSD with mixed supervision, which learns novel categories (target domain) using weak annotations with the help of full annotations of existing base categories (source domain). We propose a novel two-student learning framework, which contains two mutual helping student models ($\mathit{s\_student}$ and $\mathit{w\_student}$) that learn from fully- and weakly-annotated datasets, respectively. Specifically, we first propose a frame-level knowledge distillation strategy to transfer the class-agnostic knowledge from $\mathit{s\_student}$ to $\mathit{w\_student}$. After that, a pseudo supervised (PS) training is designed to transfer the knowledge from $\mathit{w\_student}$ to $\mathit{s\_student}$. Lastly, an adversarial training strategy is proposed, which aims to align the data distribution between source and target domains. To evaluate our method, we build three TSD datasets based on UrbanSound and Audioset. Experimental results show that our methods offer about 8\% improvement in event-based F score.
翻译:目标检测( TSD) 旨在从参考信息的混合音频中检测目标声音。 先前的工作显示, 良好的检测性能取决于完全附加说明的数据。 但是, 收集充分附加说明的数据是劳动的延伸性。 因此, 我们考虑TSD, 以混合监管方式, 借助现有基类( 源域) 的完整说明来学习小类( 目标域) 的微弱注释。 我们建议了一个新型的双学生学习框架, 包含两个相互帮助的学生模型( $\ mathit{ { suspudent} 和 $\ mathit{ w{ tutud} ), 分别从完整和微弱附加说明的数据集中学习。 具体地说, 我们首先提出一个框架级知识蒸馏战略, 将等级知识从$\\ { { { { { } { { { student} 转到 $ 。 $\ \ = laudal deal develop the transal train resmal develop legal stration as the destyal deal develop.