Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we first establish a structural causal model (SCM) to reveal that the context is the main cause of co-occurrence confounders that mislead the model to learn spurious correlations between frames and clip-level labels. Based on the causal analysis, we propose a causal intervention (CI) method for WSSED to remove the negative impact of co-occurrence confounders by iteratively accumulating every possible context of each class and then re-projecting the contexts to the frame-level features for making the event boundary clearer. Experiments show that our method effectively improves the performance on multiple datasets and can generalize to various baseline models.
翻译:为了解决这一问题,我们首先建立了一个结构因果模型(SSED),以揭示这一背景是共同引发混乱的主要原因,从而误导模型以了解框架和剪贴标签之间的虚假关联。 根据因果分析,我们提出一种因果干预(CI)方法,供WSSED采用,通过反复积累每一类的每一种可能情况,消除共同引发的消极影响,然后重新预测背景特征的背景,使事件边界更加清晰。 实验表明,我们的方法有效地改进了多个数据集的性能,并可以概括到各种基线模型。</s>