The audio data is increasing day by day throughout the globe with the increase of telephonic conversations, video conferences and voice messages. This research provides a mechanism for identifying a speaker in an audio file, based on the human voice biometric features like pitch, amplitude, frequency etc. We proposed an unsupervised learning model where the model can learn speech representation with limited dataset. Librispeech dataset was used in this research and we were able to achieve word error rate of 1.8.
翻译:随着电话交谈、电视会议和语音信息的增多,全球各地的音频数据日复一日地增加。这一研究提供了一种机制,用以根据人类声音生物鉴别特征,如音频、振幅、频率等,在音频文件中识别发言者。我们提出了一个无人监督的学习模式,模型可以借此学习有限的数据集的语音表达。Librispeech数据集用于这一研究,我们能够达到1.8字误差率。