Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition, attention can be used to locate and segment the words. We show, however, that even on monolingual data this approach is brittle. In our experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task. Models trained to predict words from either phones or speech (i.e., the opposite direction needed to generalize to new data), yield much worse results, suggesting that attention-based segmentation is only useful in limited scenarios.
翻译:单词分割问题,即在语音中找到单词界限的问题,对于一系列任务来说是值得关注的。 以前的论文指出,对于在语言翻译或语音识别等任务方面受过培训的顺序到顺序模型,可以把注意力用于定位和分割词组。 但是,我们表明,即使对于单语数据,这一方法也是易碎的。 在我们对不同输入类型、数据大小和分解算法的实验中,只有经过培训能够从文字中预测手机成功的模型。 受过训练能够预测电话或语音中的文字的模型(即对新数据进行概括所需的相反方向),其结果更差得多,表明基于注意力的分割只有在有限的情形下才有用。