While several self-supervised approaches for learning discrete speech representation have been proposed, it is unclear how these seemingly similar approaches relate to each other. In this paper, we consider a generative model with discrete latent variables that learns a discrete representation for speech. The objective of learning the generative model is formulated as information-theoretic co-training. Besides the wide generality, the objective can be optimized with several approaches, subsuming HuBERT-like training and vector quantization for learning discrete representation. Empirically, we find that the proposed approach learns discrete representation that is highly correlated with phonetic units, more correlated than HuBERT-like training and vector quantization.
翻译:虽然提出了几种自我监督的方法来学习独立的言语表达方式,但尚不清楚这些似乎相似的方法彼此之间有何关联。 在本文中,我们考虑了一种具有独立的潜在变量的基因模型,以学习独立的言语表达方式。学习基因模型的目的是作为信息理论共修培训来制定。除了广泛的普遍性外,还可以采用几种方法来优化目标,将类似HuBERT的培训和矢量量化结合起来,以学习离散代表方式。我们很生动地发现,拟议的方法学会了与语音单位高度关联的离散代表方式,比HuBERT式培训和矢量量化更为相关。