Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker. Depending on such a belief, a pairwise constraint is constructed with noise augmentation policies, used to train AutoEmbedder architecture that generates speaker embeddings. Without relying on domain adaption policy, the process unsupervisely produces clusterable speaker embeddings, termed unsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognition datasets for English language, TIMIT, and LibriSpeech. Also, a Bengali dataset is included to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, we conclude that the proposed approach achieves satisfactory performance using pairwise architectures.
翻译:发言人的承认涉及通过演讲来表彰发言者。大多数发言者的承认系统是建立在两个阶段的基础上,第一阶段是从演讲中提取低维关联嵌入的,第二阶段则执行分类任务。发言者的承认系统的稳健性主要取决于语音嵌入的提取过程,主要取决于对大规模数据集进行预先培训的语音嵌入系统的提取过程。由于嵌入系统是预先培训的,因此发言者的承认模型的性能在很大程度上取决于域适应政策,如果用不充分的数据来培训的话,这种政策可能会减少。本文介绍了一个涉及未标出的数据的语音识别战略,这种数据从小型固定大小的语音框中产生可分组嵌入矢量。不受监督的培训战略包含一个小型的语音嵌入部分的假设。一个小型语音识别系统应当包含一个单一的发言人部分。根据这种信念,一种双向制约是用扩音政策来构建的。由于嵌入式系统是预先培训的,由于嵌入式系统不依赖域适应政策,因此该程序可能不具有超度地生成可标集式的发言人嵌入式嵌入,称为不超固的矢入式矢入器的矢入器(用户)。评价是两个用户的表达式的表达式语音嵌入式的矢入器,从固定语音嵌入式的矢中,从小的矢入式的矢入式的矢入式的矢入的矢出。评价是两个用户的演讲者,评价结论结论结论,在两个用户的演讲者,评价结论中,评价是在两个用户的演讲者在英语域域域域域域域域中,在英语、我们的识别式的识别式的识别式的语音图图图图图图图图图图图,最后是,最后,用于我们的版本图图图图图图图图,最后是用于了英域域图,我们的变式图图图图,我们的变式图,我们的变式图的版本。最后将了英的变式图式图。我们的版本,我们图,我们图的版本图的版本的版本的版本,我们的变式图。我们的版本,我们的变式图式图。