The paper presents a method for spoken term detection based on the Transformer architecture. We propose the encoder-encoder architecture employing two BERT-like encoders with additional modifications, including convolutional and upsampling layers, attention masking, and shared parameters. The encoders project a recognized hypothesis and a searched term into a shared embedding space, where the score of the putative hit is computed using the calibrated dot product. In the experiments, we used the Wav2Vec 2.0 speech recognizer, and the proposed system outperformed a baseline method based on deep LSTMs on the English and Czech STD datasets based on USC Shoah Foundation Visual History Archive (MALACH).
翻译:本文介绍了一种基于变换器结构的口头术语探测方法。 我们建议使用两个类似 BERT 的编码器编码器结构, 并进行进一步的修改, 包括进化层和上层取样层、 注意力遮罩和共享参数。 编码器预测了一个公认的假设和搜索术语, 进入一个共享嵌入空间, 在那里, 使用校准的点产品来计算推定打击的得分。 在实验中, 我们使用了 Wav2Vec 2. 0 语音识别器, 提议的系统比基于基于基于USC Shoah 基金会视觉历史档案( MALACH) 的英国和捷克性病数据集的深度LSTMs( 深度LSTMs) 的基线方法要强。