We introduce the Merkel Podcast Corpus, an audio-visual-text corpus in German collected from 16 years of (almost) weekly Internet podcasts of former German chancellor Angela Merkel. To the best of our knowledge, this is the first single speaker corpus in the German language consisting of audio, visual and text modalities of comparable size and temporal extent. We describe the methods used with which we have collected and edited the data which involves downloading the videos, transcripts and other metadata, forced alignment, performing active speaker recognition and face detection to finally curate the single speaker dataset consisting of utterances spoken by Angela Merkel. The proposed pipeline is general and can be used to curate other datasets of similar nature, such as talk show contents. Through various statistical analyses and applications of the dataset in talking face generation and TTS, we show the utility of the dataset. We argue that it is a valuable contribution to the research community, in particular, due to its realistic and challenging material at the boundary between prepared and spontaneous speech.
翻译:我们引入了默克尔·波德卡斯尔·科普斯,这是一个德文的视听文字资料库,收集自16年(近似)每周一次的德国总理安吉拉·默克尔的互联网播客。据我们所知,这是德语中第一个由类似大小和时间范围内的音频、视觉和文字模式组成的单一发言者资料库。我们描述了我们收集和编辑数据所使用的方法,其中包括下载录象、记录誊本和其他元数据、强制校正、进行积极的发言者识别和进行面对面的探测,以便最终整理由安吉拉·默克尔所讲的言论组成的单一发言者数据集。拟议的管道是通用的,可用于整理类似性质的其他数据集,例如访谈节目内容。我们通过对谈话面部和TTS中数据集的各种统计分析和应用,展示了数据集的效用。我们说,它对于研究界来说是一个宝贵的贡献,特别是因为它在准备式和自发式演讲的边界上提供了现实和具有挑战性的材料。