Several variants of the Long Short-Term Memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful fANOVA framework. In total, we summarize the results of 5400 experimental runs ($\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
翻译:长期短期内存(LSTM)结构自1995年建立以来,就为经常性神经网络提出了若干变体。近年来,这些网络已成为各种机器学习问题的最新模型,这使人们重新关注理解典型LSTM变体的各种计算组成部分的作用和效用。在本文件中,我们介绍了对八个LSTM变体的首次大规模分析,这三项具有代表性的任务有:语音识别、笔迹识别和多声音乐模型。每个任务的所有LSTM变体的超参数都分别使用随机搜索加以优化,并且利用强大的FANOVA框架评估其重要性。我们还注意到,所研究的5400个实验运行的结果(15年的CPU时间)使我们的研究成为LSTM网络中最大的一类。我们的结果显示,没有一个变体能够大大改进标准的LSTM结构,并显示遗忘门和产出激活功能是其最关键的组成部分。我们还注意到,所研究的超频谱仪几乎是独立的,并提出了对其有效调整的指导方针。