Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.
翻译:感知信号中具有深远意义的信息内容通常分布不均。 例如,在语音信号中,往往有许多沉默,发音速度可能有很大差异。在这项工作中,我们提议慢自动校对器(SlowAEs),用于在不受监督的情况下学习高层次的可变离散序列表达方式,并将其应用到语言上。我们显示,由此产生的事件表达方式根据输入信号中突出信息密度的密度自动增长或缩小,同时仍然允许忠实的信号重建。我们开发运行长的变换器(RLTs),用于基于事件的演示模型,并用来在语音领域构建语言模型,能够生成语法和语义一致性的表达和延续。