We improve on the popular conformer architecture by replacing the depthwise temporal convolutions with diagonal state space (DSS) models. DSS is a recently introduced variant of linear RNNs obtained by discretizing a linear dynamical system with a diagonal state transition matrix. DSS layers project the input sequence onto a space of orthogonal polynomials where the choice of basis functions, metric and support is controlled by the eigenvalues of the transition matrix. We compare neural transducers with either conformer or our proposed DSS-augmented transformer (DSSformer) encoders on three public corpora: Switchboard English conversational telephone speech 300 hours, Switchboard+Fisher 2000 hours, and a spoken archive of holocaust survivor testimonials called MALACH 176 hours. On Switchboard 300/2000 hours, we reach a single model performance of 8.9%/6.7% WER on the combined test set of the Hub5 2000 evaluation, respectively, and on MALACH we improve the WER by 7% relative over the previous best published result. In addition, we present empirical evidence suggesting that DSS layers learn damped Fourier basis functions where the attenuation coefficients are layer specific whereas the frequency coefficients converge to almost identical linearly-spaced values across all layers.
翻译:我们改进了大众相容结构,将深度时间变化与对角状态空间(DSS)模型相取代。DSS是最近推出的线性RNNs的变体,通过对角状态过渡矩阵将线性动态系统离散而获得。DSS层将输入序列投射到正方形多元模拟空间,其中基函数、度值和支持的选择由过渡矩阵的双元值控制。我们将神经中继器与符合或拟议对角状态空间变异器(DSSexer)对三个公共公司进行对比:英语对流性电话发言交换台,300小时,开关板+Fisher 2000小时,以及全方位幸存者测试档案,称为MALACH 176小时。在300/2000开关板上,我们分别达到8.9%/6.7%的单一模型性能,在HUB5 2000联合测试集中,在MALACH变异变异变异变异器(DER)中,我们把WER值比前几个公共公司增加7%的相对值。此外,我们展示了四层正阶层的实验证据,显示整个层的比标准。</s>