Synthesized speech is common today due to the prevalence of virtual assistants, easy-to-use tools for generating and modifying speech signals, and remote work practices. Synthesized speech can also be used for nefarious purposes, including creating a purported speech signal and attributing it to someone who did not speak the content of the signal. We need methods to detect if a speech signal is synthesized. In this paper, we analyze speech signals in the form of spectrograms with a Compact Convolutional Transformer (CCT) for synthesized speech detection. A CCT utilizes a convolutional layer that introduces inductive biases and shared weights into a network, allowing a transformer architecture to perform well with fewer data samples used for training. The CCT uses an attention mechanism to incorporate information from all parts of a signal under analysis. Trained on both genuine human voice signals and synthesized human voice signals, we demonstrate that our CCT approach successfully differentiates between genuine and synthesized speech signals.
翻译:今天,由于虚拟助手的流行、生成和修改语音信号的易用工具的普及,以及远程工作做法,合成演讲很常见。合成演讲也可以用于邪恶的目的,包括创建所谓的语音信号并将其归因于不讲信号内容的人。我们需要一些方法来检测是否将语音信号合成。在本文中,我们用一个集约变换器分析以光谱形式表达的语音信号,以进行合成语音检测。CCT利用一个向网络引入诱导偏差和共享重量的演动层,允许一个变异器结构以较少的数据样本运行良好,用于培训。CCT使用关注机制将来自正在分析的信号所有部分的信息纳入其中。我们用真实的人类语音信号和合成的人类语音信号进行了培训,我们证明我们的CCT方法成功地区分了真实的和合成的语音信号。