In this paper, we explore the neural architecture search (NAS) for automatic speech recognition (ASR) systems. With reference to the previous works in the computer vision field, the transferability of the searched architecture is the main focus of our work. The architecture search is conducted on the small proxy dataset, and then the evaluation network, constructed with the searched architecture, is evaluated on the large dataset. Especially, we propose a revised search space for speech recognition tasks which theoretically facilitates the search algorithm to explore the architectures with low complexity. Extensive experiments show that: (i) the architecture searched on the small proxy dataset can be transferred to the large dataset for the speech recognition tasks. (ii) the architecture learned in the revised search space can greatly reduce the computational overhead and GPU memory usage with mild performance degradation. (iii) the searched architecture can achieve more than 20% and 15% (average on the four test sets) relative improvements respectively on the AISHELL-2 dataset and the large (10k hours) dataset, compared with our best hand-designed DFSMN-SAN architecture. To the best of our knowledge, this is the first report of NAS results with large scale dataset (up to 10K hours), indicating the promising application of NAS to industrial ASR systems.
翻译:在本文中,我们探讨了自动语音识别系统的神经结构搜索(NAS) 。 关于计算机视觉领域先前的工程,搜索结构的可转移性是我们工作的主要重点。建筑搜索是在小型代用数据集上进行的,然后在大型数据集上对与搜索结构一起建造的评价网络进行评估。特别是,我们提议了在理论上便利搜索算法以探索低复杂性结构的语音识别任务的订正搜索空间。广泛的实验表明:(一)小型代用数据集搜索的架构可以转移到语音识别任务的大型数据集中。 (二)在经修订的搜索空间中学习的架构可以大大减少计算性能退化的间接数据和GPU记忆使用率。 (三) 搜索结构可以实现超过20%和15%(平均在四个测试系统上)的相对改进,AISHELL-2数据集和大型(10k小时)数据集的相对改进,与我们最佳手工设计的DFMN-SAN数据集相比,这是我们了解的最佳情况,这是NAS的首次报告,显示NAS10小时的大规模数据应用。