Digital technology has made possible unimaginable applications come true. It seems exciting to have a handful of tools for easy editing and manipulation, but it raises alarming concerns that can propagate as speech clones, duplicates, or maybe deep fakes. Validating the authenticity of a speech is one of the primary problems of digital audio forensics. We propose an approach to distinguish human speech from AI synthesized speech exploiting the Bi-spectral and Cepstral analysis. Higher-order statistics have less correlation for human speech in comparison to a synthesized speech. Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech. We integrate both these analyses and propose a machine learning model to detect AI synthesized speech.
翻译:数字技术使得无法想象的应用成为现实。 拥有一小撮易于编辑和操作的工具似乎令人兴奋,但它引起了令人震惊的担忧,这些担忧可以作为语言克隆、复制或可能深层的假象传播。 验证演讲的真实性是数字声学法证的主要问题之一。 我们提出一种方法,利用双光谱和Cepstral分析,将人的演讲与AI合成的演讲区分开来。 高阶统计数据与合成的演讲相比,与人类演讲的相关性较小。 另外,Cepstral分析还揭示了人类演讲中缺少的耐久力成分,而合成的演讲却缺少这种元素。我们将这些分析结合起来,并提出一个机器学习模型来检测人工合成的演讲。