We introduce a method to identify speakers by computing with high-dimensional random vectors. Its strengths are simplicity and speed. With only 1.02k active parameters and a 128-minute pass through the training data we achieve Top-1 and Top-5 scores of 31% and 52% on the VoxCeleb1 dataset of 1,251 speakers. This is in contrast to CNN models requiring several million parameters and orders of magnitude higher computational complexity for only a 2$\times$ gain in discriminative power as measured in mutual information. An additional 92 seconds of training with Generalized Learning Vector Quantization (GLVQ) raises the scores to 48% and 67%. A trained classifier classifies 1 second of speech in 5.7 ms. All processing was done on standard CPU-based machines.
翻译:我们引入了一种通过高维随机矢量计算来识别发言者的方法。 它的优点是简单和速度。 我们只有1.02k活性参数和128分钟的通过培训数据,在VoxCeleb1数据集的1,251个发言者中达到31%和52%的成绩。这与有线电视新闻网模型不同,它只需要几百万参数和更高数量级的计算复杂度,只有2倍于在相互信息中测量的歧视性力量。另外92秒钟的通用知识矢量化(GLVQ)培训将得分提高到48%和67%。经过培训的分类器将1秒的语音分类为5.7毫秒。所有处理都是在标准CPU基础上的机器上完成的。