Synthesizing voice with the help of machine learning techniques has made rapid progress over the last years [1] and first high profile fraud cases have been recently reported [2]. Given the current increase in using conferencing tools for online teaching, we question just how easy (i.e. needed data, hardware, skill set) it would be to create a convincing voice fake. We analyse how much training data a participant (e.g. a student) would actually need to fake another participants voice (e.g. a professor). We provide an analysis of the existing state of the art in creating voice deep fakes, as well as offer detailed technical guidance and evidence of just how much effort is needed to copy a voice. A user study with more than 100 participants shows how difficult it is to identify real and fake voice (on avg. only 37 percent can distinguish between real and fake voice of a professor). With a focus on German language and an online teaching environment we discuss the societal implications as well as demonstrate how to use machine learning techniques to possibly detect such fakes.
翻译:过去几年来,利用机器学习技术合成声音取得了快速进展[1],最近报告了第一批高知名度欺诈案件[2]。鉴于目前使用会议工具进行在线教学的情况有所增加,我们质疑这样做是否容易(即需要数据、硬件、技能组)产生令人信服的声音。我们分析参与者(如学生)实际需要多少培训数据来假冒另一个参与者的声音(如教授)。我们分析了在创建声音深层假冒方面现有的最新水平,并提供了详细的技术指导和证据,说明需要多少努力来复制声音。由100多名参与者组成的用户研究显示,识别真实和假声音有多么困难(例如,只有37%的人能区分教授的真实和假声音)。我们以德语和在线教学环境为重点,讨论社会影响,并展示如何使用机器学习技术来探测此类假声音。