The detection of perceived prominence in speech has attracted approaches ranging from the design of linguistic knowledge-based acoustic features to the automatic feature learning from suprasegmental attributes such as pitch and intensity contours. We present here, in contrast, a system that operates directly on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment. The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters as the first convolutional layer. We further explore the benefits of the linguistic association between the prosodic events of phrase boundary and prominence with different multi-task architectures. Matching the previously reported performance on the same dataset of a random forest ensemble predictor trained on carefully chosen hand-crafted acoustic features, we evaluate further the possibly complementary information from hand-crafted acoustic and pre-trained lexical features.
翻译:语音中显要感的探测吸引了各种方法,从设计语言知识的音频特征到从高音和强度轮廓等超分级特征中自动学习特征,我们在此提出一个系统,直接在分形语音波形上运行,学习与儿童口腔流水评估突出字性检测有关的特征。选定的CRNN(演进常态神经网络)框架,既包括字级特征,也包括顺序信息,被认为受益于作为第一个演进层的感官驱动的SincNet过滤器。我们进一步探索了短语边界和突出度与不同多任务结构之间预言性事件之间的语言联系。匹配以前报告的随机森林集合预测器在精心选择的手动声学特征方面的同一数据集的性能,我们进一步评估手动声学和预先训练的词汇特征可能提供的补充信息。