In this paper, we present end-to-end and speech embedding based systems trained in a self-supervised fashion to participate in the ACM Multimedia 2022 ComParE Challenge, specifically the stuttering sub-challenge. In particular, we exploit the embeddings from the pre-trained Wav2Vec2.0 model for stuttering detection (SD) on the KSoF dataset. After embedding extraction, we benchmark with several methods for SD. Our proposed self-supervised based SD system achieves a UAR of 36.9% and 41.0% on validation and test sets respectively, which is 31.32% (validation set) and 1.49% (test set) higher than the best (DeepSpectrum) challenge baseline (CBL). Moreover, we show that concatenating layer embeddings with Mel-frequency cepstral coefficients (MFCCs) features further improves the UAR of 33.81% and 5.45% on validation and test sets respectively over the CBL. Finally, we demonstrate that the summing information across all the layers of Wav2Vec2.0 surpasses the CBL by a relative margin of 45.91% and 5.69% on validation and test sets respectively. Grand-challenge: Computational Paralinguistics ChallengE
翻译:在本文中,我们展示了以自我监督方式参加ACM 多媒体 2022 ComParE 挑战,特别是口交小挑战的自上至上和语音嵌入系统。特别是,我们利用预先训练的Wav2Vec2.0 模型嵌入KSOF 数据集中断层检测模式。在嵌入提取后,我们用几种SD方法作为基准基准。我们提议的基于SD的自上和语音嵌入系统在验证和测试机组上分别达到36.9%和41.0%的UAR,比最佳(深点)挑战基线高31.32%(校准集)和1.49%(测试机组)。此外,我们显示,连接层与Mel-频率阴部系数(MFCCs)的嵌入层具有进一步改善UAR的33.81%和5.45%的校准和测试机组。最后,我们证明,在W2-2CV2和GRV2的相对测试中,所有层次的查封数信息分别超过W2-69 Vebal 和GRestral 的5-CUCRisal 。