In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at https://github.com/965694547/Hybrid-system-of-frame-wise-model-and-SEDT.
翻译:在本文中,我们详细描述我们的DCASE 2022 Table4系统。这个系统综合了两个大不相同的模式:端到端的无害事件探测变异器(SEDT)和一个框架型模型(Metric Learning and Colleases CNN),前者是一个了解事件级别表现并直接预测无害事件类别和界限的事件性模型,而后者则以广泛采用的框架分类办法为基础,根据这个办法,每个框架都分类为事件类别,事件界限通过后处理获得,例如门槛值和平稳。对于SEDT,应用了使用未贴标签数据进行自我监督的预培训,而使用在线教师采用半监督式学习,该方法根据学生模型更新,使用 " 指数移动平均 " (EMA)战略,为标签薄弱和无标签的数据生成可靠的假标签标签。对于框架型模型而言,DCASE 20任务4的信通技术-TOSHB系统由后处理获得。实验结果显示,混合系统大大超越了使用无标签的数据模型和0.420的单个模型,并在外部数据中实现了0.84/SERBs pd。