In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most no-tably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA for Google Speech Commands dataset V1 10+2 -class classification by about 34%, achieving 98.55% accuracy, V2 10+2-class classification by about 20%, achieving 98.37% accuracy, and V2 35-class classification by over 50%, achieving 97.0% accuracy.
翻译:在过去几年里,三重损失的基于指标的嵌入,对于若干重要的计算机视觉问题,最不可避免的是,个人再识别问题,三重损失的基于指标的嵌入,已成为一个实际标准。另一方面,在语音识别方面,三重损失产生的基于指标的嵌入,甚至很少用于分类问题。我们填补了这一空白,表明两种代表性学习技术的结合:三重损失嵌入,以及用于分类的KNN的变量,而不是跨性器官损失(26%至38%),提高了LibriSpeech派的LibriWords数据集中革命网络的分类准确性。为了做到这一点,我们提出了一种新的基于语音相似性的基于三重采矿方法。我们还改进了目前出版的谷歌语音指挥部SOTA数据集V1 10+2级分类的最佳SOTA, 增加了大约34%,实现了98.55%的精确度,V2 10+2级分类增加了约20%,实现了98.37%的准确性,V2 35级分类增加了50%以上,实现了97.0%的准确性。