The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.
翻译:本文描述了使用深 LSTM 网络在大型口腔档案中使用深 LSTM 网络进行口语探知(STD)的新办法。 这项工作基于以前使用STD 的Siamees神经神经网络的方法,自然地将其扩展为直接本地化并估计其关联性分数。 电话识别器生成的电话混淆网络由深 LSTM 网络处理,该网络将混乱网络的每个部分投射到嵌入空间中。 搜索术语将使用另一个深 LSTM 网络投射到同一嵌入空间中。 相关评分随后使用嵌入空间的一个简单点产品进行计算, 并使用示意图函数校准以预测发生概率。 搜索术语的位置随后根据产出概率的顺序估算。 深 LSTM 网络从文字和电话水平的对称识别假称中以自我监督的方式接受培训。 该方法将用英语和捷克语的MALACH 数据进行实验性评估。