We propose multi-layer perceptron (MLP)-based architectures suitable for variable length input. MLP-based architectures, recently proposed for image classification, can only be used for inputs of a fixed, pre-defined size. However, many types of data are naturally variable in length, for example, acoustic signals. We propose three approaches to extend MLP-based architectures for use with sequences of arbitrary length. The first one uses a circular convolution applied in the Fourier domain, the second applies a depthwise convolution, and the final relies on a shift operation. We evaluate the proposed architectures on an automatic speech recognition task with the Librispeech and Tedlium2 corpora. The best proposed MLP-based architectures improves WER by 1.0 / 0.9%, 0.9 / 0.5% on Librispeech dev-clean/dev-other, test-clean/test-other set, and 0.8 / 1.1% on Tedlium2 dev/test set using 86.4% the size of self-attention-based architecture.
翻译:我们提议了适合不同长度输入的多层透视仪(MLP)基础建筑。最近提议用于图像分类的基于 MLP 的建筑,只能用于固定、预设大小的输入。然而,许多类型的数据在长度上自然是可变的,例如声波信号。我们建议了三种方法来扩展基于 MLP 的建筑,以便使用任意长度序列。第一个方法是在Fourier 域应用循环变换,第二个是深度变换,最后依靠轮值操作。我们评估了与Librispeech 和 Tedlium2 Corpora 进行自动语音识别任务的拟议结构。基于MLP 的最佳结构将WER提高1.0/0.9%, 0.9%/0.5%用于Librispeech dev-清洁/dev-other, 测试-清洁/试验-其他组,0.8%/1.1%用于Tedlium2 dev/Test组,使用自控结构大小的86.4%。