Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.
翻译:在连续的语音中查找字条是困难的,因为字词之间的“空格”分隔器很少或根本没有等同。流行的巴伊西亚非参数化文本分割模型使用迪里赫莱特进程来联合分割句子和建立一个字型词汇表。我们引入了DP-Parse, DP-Parse使用相似的原则,但只依赖单词符号词汇表, 避免单词类型词汇表产生的群集错误。 在“ 零资源演讲” 基准 2017 上,我们的模型用5种语言设置了新的语音分解状态。 算法单调法以更好的输入表示方式改进,在用薄弱的监管输入时达到更高的分数。 尽管缺少一种类型的词汇表, DP-Parse可以编成一个语言模型,并学习由新语言嵌入基准评估的语义和合成表达方式。