This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data, to build an overlapped speech detection (OSD) and a gender detection (GD) systems. In this study, we use two different corpora. The DIHARD III corpus which is well adapted for the OSD task but lack gender information. The ALLIES corpus fits with the project application context. Our best OSD system is a Temporal Convolutional Network (TCN) with WavLM pre-trained features as input, which reaches a new state-of-the-art F1-score performance on DIHARD. A neural GD is trained with WavLM inputs on a gender balanced subset of the French broadcast news ALLIES data, and obtains an accuracy of 97.9%. This work opens new perspectives for human science researchers regarding the differences of representation between women and men in French media.
翻译:本文侧重于对法国视听媒体中男女互动的重叠言论和性别探测,以研究法国视听媒体中的男女互动(性别平等监测项目),在应用方面,我们需要根据发言者的性别自动分割语音信号,并在至少两名发言者同时发言时确定。我们提议使用WavLM模式,该模式的优点是,对大量语音数据进行预先培训,以建立一个重叠语音检测和性别检测系统。在本研究中,我们使用了两种不同的公司。DIHARD III 文集,该文集适应了OSD的任务,但缺乏性别信息。ALIES文集适合项目应用环境。我们最好的OSD系统是时空演动网络(TCN),有WavLM预先培训的功能作为投入,该文集在DIHARD上达到了一个新的最新水平的F1核心表现。一个神经GD在WavLM文集中接受了关于法国广播ARIES新闻数据中性别均衡的部分的培训,并获得了97.9%的准确性信息。这项工作为女性研究人员和法国媒体之间的新视角。