Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.
翻译:深神经网络的若干变体已被成功用于建立参数模型,这些模型将可变时间的口头文字部分投射到固定尺寸的矢量表示器或声词嵌入器(AWES)上。然而,尚不清楚我们在多大程度上可以依赖新兴的AWE空间的距离来估计单词形式相似性。在本文中,我们询问:声学嵌入空间的距离是否与声学差异性相关?为了回答这一问题,我们实证地调查了不同神经结构和学习目标的AWES监督方法的性能。我们为两种语言(德语和捷克语)在控制环境中培训了AWE模型,并评估了两种任务:文字歧视和声学相似性的嵌入。我们的实验表明:(1) 最佳情况下的嵌入空间的距离仅与声学距离有中度的关联性,以及(2) 改进词歧视任务的表现不一定产生更好地反映词声学相似性的模型。我们的调查结果突出表明有必要重新思考目前对AWE的内在评价。